IMDb Data – Now available in Amazon S3

  • 5
  • Announcement
  • Updated 3 weeks ago
  • (Edited)
This is an announcement for customers of the IMDb bulk data available via FTP.

We are pleased to announce, starting today IMDb datasets are now available in Amazon S3 via an HTTPS link. Using the new interface, customers can bulk-access IMDb title and name data.

For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.


In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:


  • Data refresh frequency is now daily (previously weekly).
  • IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
  • The files are in tab separated values (TSV) format.
  • The sets of data we provide are updated to only include the essential ones that help with matching and linking to an IMDb title or name.
As part of housekeeping the FTP site, the data files will no longer be updated. The list data files will continue to be available at two locations (see below) until February 28, 2017. We strongly encourage FTP site users to switch to the S3 solution at the earliest to ensure their applications continue to work without interruption.

ftp://ftp.funet.fi/pub/mirrors/ftp.imdb.com/pub/frozendata

ftp://ftp.fu-berlin.de/pub/misc/movies/database/frozendata

 If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license.  The license grants you access to our content via an XML web service, plus the right to use the content in your product or service.  If that interests you, please email licensing@imdb.com.

 If you have any questions or concerns, please share your feedback in this thread.

 Thank you for your continued support.
Photo of sv

sv, Official Rep

  • 31 Posts
  • 18 Reply Likes

Posted 2 years ago

  • 5
Photo of JW

JW

  • 1 Post
  • 3 Reply Likes
You must be joking. Some sort of convoluted user-registration and user-pays is somehow going to be better than ftp? Seems very stupid and totally unnecessary if you ask me. And AWS would have to be the most convoluted and non-user-friendly system anyone has ever invented.
(Edited)
Photo of Scott

Scott

  • 2 Posts
  • 5 Reply Likes
For my use cases, which is a personal TV listings application, I am missing aka-titles.list, ratings.list and language.list.

I use aka-titles.list to match TV listing data to IMDB info, because the TV listing data often uses an alternative name for movies.

For ratings.list, unless I am missing something, the top 250 information is not available in the new format, which I use to highlight those movies in the TV listings as well as to track my progress in watching those movies.

I use language.list to show the original language of movies as opposed to the language being broadcasted.
Photo of Luca Canali

Luca Canali

  • 2 Posts
  • 5 Reply Likes
For my personal TV application the aka-titles.list is really a requirement. Consider that outside USA you really need to use the translated titles, as the English and original titles are just unknown to most people.

The keyword.list data is also required, as it allows to categorize movies.

Also please continue to share the data file with FTP/HTTP. Most people won't start using and paying S3 for their personal projects just to obtain such data.

You are just forcing people to start scraping imdb.com to obtain the same information that before was easily and freely obtainable from FTP/HTTP.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
It is sad to see the end of what Col Needham started.

I wonder what he would think of this direction for his project, it certainly flies in the face of the entire nature of IMDB's history and values.

Those that continue to need this sort of access, might want to check out https://www.themoviedb.org
Photo of Luca Canali

Luca Canali

  • 2 Posts
  • 5 Reply Likes
Even more frustrating is that is not understandable the reason of doing this.

Really, why forcing to use S3 instead of a normal FTP/HTTP download ? And why removing so much useful info ?

Would be interesting to know if Col Needham is really aware of this...
Photo of Vincent Fournols

Vincent Fournols

  • 2901 Posts
  • 4888 Reply Likes
I am afraid so, as he replied to a post above...
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
I missed that thanks, I'll note that he didn't really say anything supporting the move, mostly just how to do something.  That he didn't announce this, isn't voicing any support for this to date speaks volumes.  
After so long it seems a shame to move to another service, IMDB used to be synonymous with the best source of movie data.  Guess its now a marketing thing.  
What a waste of decades of people contributing to make it better, and now, its something to sell Amazon goods and services.
Photo of TheObviousOne

TheObviousOne

  • 1 Post
  • 8 Reply Likes
This seems to negate the bidirectional, symbiotic nature of IMDb. It seems to purposefully ignore that IMDb wouldn't be the "same" if it wouldn't be for both the contributors and for ... IMDb itself.  One without the other(s) and the IMDb data (what "IMDb" is at its core) of today would be many times of "poorer" quality. It is true that there is a fundamental asymmetry in how IMDb works, but that hasn't never been the problem, more like its strength: the contributors which contribute the most aren't certainly the ones whom benefit profit/consume the most data; contributors give time, IMDb time and all the (necessary) rest.

This "move" from "IMDb" seems unnecessary because there other models that would work in reducing operational costs and data consumption greatly without the need to subject the bona fide data consumers to this convoluted, Rube Goldberg-esque, nonsensical process: the first that comes to mind, without putting too much thought into it, and with trivial impact (for both interested parties) it would be to maintain the FTP process but, instead of being anonymous, make it to a previously assigned user:password (maybe the same as the IMBd login credentials?) and to a specific time-based quota (X files/bytes could be downloaded daily/monthly/yearly/whatever per FTP/IMDb user). That would separate the "abusers" from the "little guy" (it's seemingly ironical how the ones that would be most damaged by the "new" process would be the ones that "consume" the less and, probably, contribute the most...; this "move" equates to shooting birds with a cannon, not the right tool for the job: if the aim is to deter and fight abuse [I infer, as no "plausible" motive was forwarded, the other alternative being some misguided corporate bottom-line measure that simply didn't account for all the factors in their equation -- meaning, there's more to be lost by possibly alienating contributors than to be gained] why does it "burn" all the rest -- incidental users: the vast majority, I surmise -- at the same time? So unfair and a whole lot of "bad-Karma points"...)

Plus, this "move" from "IMDb" seems also shortsighted because it fails to understand the fundamental reason why IMDb is, well, IMDb. It wouldn't be the same quality-wise (meaning data quality) and quantity-wise without the army(s) of busy bees (AKA contributors) constantly mending/tending/pruning/trimming/grafting, 24/7, every square inch of their "digital property". Try to convert that (if you can) to the army(s) of employees that had to be payed (plus benefits, plus ...) to work less hours and with lesser effectiveness (a labor of love, isn't really labor, and it can't really be matched). That would be the scenario of having IMDb without happy/motivated contributors (which is where "moves" just like this one would lead, eventually, down the road).

And if IMDb considers that this isn't a problem, because it's "too big to fail", right now and it can do as it pleases just ... because, I'll leave you with this 'food for thought': if (hopefully not) for some unfortunate reason IMDb would cease to exist today, and all the data would be lost, a new IMDb would rise tomorrow (let's call it for argument's sake, NewIMDb), the (existing) contributors would flock over it and in a short amount of time something resembling (Old)IMDb would be there, standing proud and tall -- (Old)IMDb nothing but a fading memory... If the exact opposite were to happen (meaning, all the contributors ceased, for some unwanted, unforeseen reason, to stop supporting, visiting and contributing every minute to the collective hive that is IMDb, the same (i.e., IMDb) would slowly -- but surely -- wither and "die" (vanish).

The longevity (and quality) of IMDb is due to its openness and two-way street nature, not in-spite of it.

Final disclaimer: I love IMDb and want it to outlive the planet itself, but I couldn't just watch doing this to itself and stand idle. I also know this isn't a "conversation" (a conversation doesn't start: "we are doing this, what do you think?"; a conversation starts with: "what do you think of the possibility of changing from this to that?") but I hope smarter (smartest) heads will prevail. For IMDb's sake and success. Including in that all of IMBd's ecosystem. Thank you.



PS: this what I would also have written if, all of a sudden, Wikipedia started erecting some kind of fences/barriers to some type of its content. NewWikipedia would be fast in its place, without missing a beat...
(Edited)
Photo of David Chappelle

David Chappelle

  • 6 Posts
  • 8 Reply Likes
The main difference between IMDb and other user-contributed data sources like Wikipedia is that IMDb has a large userbase of industry professionals who are required to update their projects and remove false information that crops up, at least for new and upcoming releases.  So any competitor to IMDb would have to find a replacement for that data.

I have a big investment in code to parse the current .list files, so like many people, I would prefer that the format remain as it is.  I'd even be willing to pay an amount proportional to my personal, non-commercial use to access the data - perhaps $40 per year for weekly updates via FTP.
Photo of Vincent Fournols

Vincent Fournols

  • 2901 Posts
  • 4886 Reply Likes
Two reactions to this:
1. I find the new data sets are much easier to integrate since they are build around the title and name codes (tt0000000 and nm0000000), as stable keys, where in the previous sets, the IMDb title was the (very heavy key), and moreover subject to change over time, typo corrections, language change... (And I have also heavily invested myself in parsing scripts from the former .lists)

2. The role of IMDB for US industry professionals have narrowed the initial universal scope of the database, which now only stick to credits (for what it is worth...). E.g. the uncredited actors, the fired directors, etc. have been removed and only cited in trivia. This leaves room for less industry-driven competitors
Photo of sv

sv, Official Rep

  • 31 Posts
  • 18 Reply Likes
We greatly appreciate our customers sharing their usecases and concerns with the new IMDb Datasets. After careful consideration,we will be adding TitleAKAs to the datasets in S3. This will be available in the coming weeks in the S3 bucket: imdb-datasets. We will update this thread and www.imdb.com/interfaces page once we have more details.

Thank you for your continued support.
(Edited)
Photo of Brian Risselada

Brian Risselada

  • 30 Posts
  • 34 Reply Likes
I appreciate that the AKAs were added. I would like it if you can keep adding more data until we have everything that we used to have before. In particular I'd like to have attributes for director roles listed. After that I'd love to have the full list of genres for each film, company credits, production countries, and languages spoken as some of the next items added.
Photo of Nick

Nick

  • 3 Posts
  • 11 Reply Likes
I am horrendously disappointed by this change. I actually legitimately cried when my IMDB database updater threw a 404.
I get that you guys want money and promoting s3, but its kind of a dirty tactic to force me to pay for something that I have spent almost two years developing and creating my own little in house 'Netflix' service for my personal collection of movies and tv, around all of the data that you guys so graciously provided us with prior.
My system was already setup to prevent strain on bandwidth on your end, only once a month it would check for updates, and then re-download updates/diffs, because I use it all...
I use the titles, all the akas, keywords/taglines/writers/quotes/ratings-reasons/locations/genres/composers/actresses/actors/directors/countries/distributors are all intricately tied into my searching, running-times, ratings, release dates, and various other files are all meshed into a distrusted MySQL database allowing me near real-time data draw.
To know I am going to HAVE to re-build, re-structure, and re-engineer the entire way my system is built around you data truly means I will probably have to scrap, and abandon the entire project.
Albeit it was just for my own personal fun it saddens me to see you guys make a change that breaks two years of my fun time work.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
Very well said.
Given that the clients are now paying the cost of transmitting the data, It is totally unclear why imdb is so focused on withholding data.  
This feels non-technical in nature, the use of this alternate data cannot possibly be hurting the revenue model for IMDB, and with "requestor pays" there is no cost to transmit the data.
It is all very curious, and I wish imdb(amazon) would more openly explain why they are doing this.
Photo of Nick

Nick

  • 3 Posts
  • 11 Reply Likes
I agree as well. Am very curious as to why or what motivation they have to make this change. I mean its been that way for 20 years now, so why the sudden change. I get and am totally all for improving/innovation, but to entirely cut people out seems backwards.
Photo of josh

josh

  • 1 Post
  • 5 Reply Likes
Have you considered to make available the new files also with FTP on a less frequent time base ?

For example, you can have daily updates on S3 and monthly updates on FTP.

In this way you can both promote S3, and at the same time don't disrupt personal and open source projects. S3 is just not an option for a lot of us.

Having keywords info would be also very good.

Thanks
Photo of valen

valen

  • 3 Posts
  • 8 Reply Likes

This is a braindead idea. From someone who doesn’t understand (or doesn’t want to understand) why IMDb came to be the number one site for movie & TV data.

I’ll explain.

Countless Contributors offered their time, for free (and time is more valuable than money, once it’s gone, it’s gone, one cannot recover it, unlike money), during thousands of hours, spanning decades to make sure IMDb’s data was correct and up-to-date. Imagine how many employees, man-hours, health benefits and all the rest that IMDb saved since the start... Just imagine...

Then, once the data is there -- good, verified, and perfect – they sell it. Multiple times. Many, many times. More and more as time goes by. There are many ways they can profit from it (direct and indirect). Even when the access is free there are valuable tie-ins (you can buy the movie from Amazon for instance, movie theaters, merchandising) where money can be made. All of this is fine. Really. It’s a business. That is data and that it can be “resold” (because is better than data the other companies can offer) over and over, until infinity. It’s a good consequence of being just data and not something physical (that you cannot sell more than once).

Summarizing, until now: many, many people give away their time, for free, to make good data for IMDb, which in turn makes ample use of it, a many times as it wants, to make money.

Now picture this. You know those people, who give away their time to make something for free for you that you can make money later, as many times as you want? Here’s an idea, why not make them pay too? Right? Neat idea, isn’t it? They gave their time away so they should give away their money as well, right? Right.

A lousy analogy would be an Airline company making the pilots and crew to buy tickets to be able to board the plane they are supposed to take to its destination. They make the same trip, right? Why shouldn’t they pay, right? It makes perfect sense, right? Right.

The point is: it’s not that they will lose money if they don’t charge Contributors: that can be done in many, many other ways. It’s not that they need to do this. It’s that they want to do this.

I want to leave here some rhetorical questions that boggle the mind.

1 – Why should a Contributor (any Contributor) keep contributing after this? Why should anyone want to? Why should they not contribute their time where is really appreciated?

2 – Will you, from now on, start to pay the Contributors for their contribution? Because every coin has 2 sides.   Because you cannot have it both ways, you cannot have your cake and eat it too: if you charge Contributors it means you say the information (which they provided in the first place) is valuable and so it should be remunerated. Or you don’t charge them for it, because you also got it from them for free. As I said, it can be both. Because we all have this things called little grey cells (borrowing from Poirot here)...
And if you retort: “It’s too difficult to pay, because it’s complicated accounting, because... “ and so on and so on... I’ll give you back your own “solution”. The “unnatural” solution you are trying to force-feed to people right now, but this one actually makes perfect sense. Also a simple solution. Give to each, say, 10 contributions of each Contributor an Amazon Gift Card with a token value, say, 1 Dollar/Euro/Pound. There. No need to thank me.

3 – Why do you thought you could do this and everyone would be fine with it? Why do you think it’s okay to ask for money from the one’s that help you most and they would be fine with it and everything would be fine afterwards? And don’t give me the “added value” line because it doesn’t pass the smell test: you cannot add value if you take away loads and loads of data. Just can’t.

4 – Why is this so rushed and quiet and through the summer (I bet many aren’t even aware of this)?

As I said, rhetorical questions.

Braindead, I say.


(Edited)
Photo of Vincent Fournols

Vincent Fournols

  • 2901 Posts
  • 4888 Reply Likes
I wish that Amazon's answer to your questions (which I support, along with your position) is as effective and swift as their customer service... But this is probably the critical point: in a previous answer above, the Amazon "official" speaks of customers, when the people reacting here to this move are contributors, probably as thin as my little finger compared to a Saturday night crowd at the movies... It is all about balance of powers. Considering the weight IMDb has gained in the industry (at least on the US side), I am not sure that all the new data IMDb gets every week come from disinterested contributors like us.

I have another question to Amazon: what would be the cost of maintaining the FTP files, say on a monthly basis instead of a weekly one? I would be ready to pay a (very reasonable) fee to keep access to this data. Especially if it is discounted when I contribute to the data feed and update, as proposed by Valen above. This would seem fair in your merchandization and monetization of what use to be a fantastic collaborative and open and free project, don't you think?

V.
Photo of valen

valen

  • 3 Posts
  • 8 Reply Likes

Actually no. People do not understand that Contributors are the real "secret sauce" behind IMDb’s success. They (Contributors) really sell themselves short. Data will "rot" very easily, if it goes without constant verification and supervision. Think of it as a green lawn. Without proper and constant maintenance it would turn into a "jungle" of sorts in a short amount of time.

Somebody mentioned Wikipedia and is a bit like that too. Without constant attention Wikipedia would quickly become unusable and worthless. And this change is even more braindead because now more than ever they will need to keep their Contributors happy because data input will grow more and more as time passes. And this is the time that they thought it be a good time to show Contributors (which they'll need more and more of as time goes by) this (symbolical) huge middle finger. As plain, common, sense, one does not (should not) antagonize the ones that one depends on.

Another way of looking at it (which is also one of my favorites) is that IMDb is the longest running, most successful case of Crowd Funding ever (it's still going at it), anticipating that movement by many many years. And the reason it worked so well and so successfully, until now, is because of 2 things, IMO: 1. Instead of money (which people tend to have more difficulty parting with, curiously enough) it asked "only" for time (which people donate more easily even if it is more valuable); and 2. The "rewards" were also very simple and easily understandable, "you give us your time and we won't cut you off when you want to make use of the data, for you own private pleasure”. There’s nothing simpler than that! And so, what do they want to do now? 1. "Give us your time AND your money too"; and 2. "Rewards? No. None of that. Here’s a reduced version of the data we chose for you IF you pay for it, like any other customer. But no rewards.". Genius!

You rightly pointed out that data is more and more coming from other interested parties. But that is only a small part of it. All data has errors, inconsistencies, typos, or it can be just plain wrong. "Dumping" data is just the first small, step. It's the eyes of the thousands (more than that) that little by little turn it into quality data. Elsewhere was a comparison with bees but I see it more as an ant colony where each ant (Contributor) does just a little bit every time but that is essential for the success of the whole colony. And one ant is dispensable but if you take all the ants from the colony it'll disappear very soon. These thousands and thousands of micro-corrections forwarded by Contributors resulted in IMDb data today. And don't get me wrong, there are errors there now. There will always be errors there. Just not the same ones and their importance and scale will be smaller and smaller with time. Providing happy Contributors still do their work, as always.

If I can misquote from "Soylent Green": "IMDb is people!”. In this case, Contributors. That is the reason it has the better (best) data. And you can quote me on that.

(Edited)
Photo of Terry Flynn

Terry Flynn

  • 4 Posts
  • 11 Reply Likes
I have no issue paying to download IMDb datasets for my own personnel use.

What I do have a problem with is moving to a paid model that does not support the same 40+, frequently updated, datasets that have been provided for free for so many years. I think IMDb need to re-address this decision as it will affect people like myself. 

I download and insert all datasets into a MySQL database. This has allowed me to develop specific applications and just data mine (discovering new uses for the data). I've taught myself SQL from data-mining IMDb data.
Photo of Doc Magro

Doc Magro

  • 1 Post
  • 2 Reply Likes
This matches what I do with it, except additionally I am moving toward mining the data for non-profit academic research. I agree the most disappointing thing is the extremely limited data this seems to be turning into.

On a curiousity-note, would love to connect with you and see what you are doing with the data.
Photo of Col Needham

Col Needham, Official Rep

  • 6834 Posts
  • 4818 Reply Likes
Official Response
Thanks for the feedback so far on this thread.  Please do continue to post and we will try to take as much as possible into account. This post answers some of the questions raised and there will be further updates based on the next round of feedback. 

On the S3 access issues, we now have a working prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges.  Please watch for an announcement as we convert this into production code. The only thing needed will be an ordinary IMDb user account attached to a valid email address.  We still intend to also make the data available via S3 for those people who find the AWS access tools more convenient and can stay within the free tier of AWS.    

On the general data availability, we are adding the AKA titles to the basic data set accessible to everyone.  Longer term, we are looking at the possibility of daily diff files for at least some of the data in the basic set. 

On the point about contributors, we are looking at extending the range of data available via the http solution based on your contribution history and volume. For top contributors and those people using the data to help us clean it via bulk corrections, this is likely to extend far beyond the current set of data even on the FTP site.  It is not our intention to deprive access to the data by those people who have genuinely helped to build it over the years and who want to continue to improve IMDb. We aim to also be able to grant specific permissions to specific customers for specific extra subsets of data as required on a case by case basis. This latter part may take some time to become a fully formed solution so please bear with us. 

The background to all of this is that there is a huge multi-year technology migration project which is nearing completion at IMDb. We have too many complicated old systems around which have been slowing the overall pace of development (I add a bit more detail to this on https://getsatisfaction.com/imdb/topics/why-doesnt-imdb-staff-ever-consult-with-the-contributor-base...).  The move to the new technology has been providing the opportunity to look at the way we operate different parts of the IMDb service.  One of the oldest software systems is the one which publishes the FTP data, and we will soon no longer to even be able to generate the .list files once the final pieces of the old IMDb system are decommissioned; at least not without re-writing all of the publication software to connect to the new system and produce an extremely difficult to manipulate text file format which was designed 27 years ago and has not changed in 21 years. Instead, we decided that it would be better to publish the data via a modern system (S3 and soon over https) in a modern format which can be more easily parsed.  The other problem with FTP is that we have no idea how many people are using the data and for what purpose, nor do we know what additional things they may want from the data. From feedback over the years, we knew some of your requirements already, notably (a) access to the title and name constant data (b) an easier to parse format (c) information to help in matching other catalogs to IMDb (d) more frequent updates.  We found ourselves having to guess the remaining requirements until we decided the best way forward was to move the data to a new location within the FTP sites, post an announcement on Get Satisfaction (this thread) and then wait to gather feedback before replying and figuring out what steps to take next (this reply). 

We hope this helps.  We have plenty to be working upon in the meantime, and we will follow-up as we deliver parts of the above. 

Col
Founder & CEO, IMDb.com. 
Photo of Jeorj Euler

Jeorj Euler

  • 7265 Posts
  • 9473 Reply Likes
Thanks, for your feedback, Col Needham. I'm pleased to know that you have big plans. I hope the things of the future wind up having more merits than demerits.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
Thank you for responding, at least the timing and some of the motivation for this is clearer.
However, without the ability to answer the simple question "who is in this move", the data is essentially useless to me.  
Sad day for imdb, when it switches from "here is some awesome data, we are excited to see what you do with it" to "Convince us why we should let you see our data"
Reluctantly I am switching to another data provider, themoviedb.org
Photo of valen

valen

  • 3 Posts
  • 8 Reply Likes

Sad day for imdb, when it switches from "here is some awesome data, we are excited to see what you do with it" to "Convince us why we should let you see our data"

 

You said it best, right there.

There are 2 main ways to "consume" IMDb data (files): 1 - The Static Mode: People already have an use for the data and go there and grab it for their usual needs; 2 - The Discovery/Explorer/Research Mode: people get some data, and then detect some patterns, get/try different data (files) and see some more patterns, get new ideas and connections, test theories, invent new uses for the data, detect errors/inconsistencies, get yet another batch of different files to retest other hypothesis...

While the first type of Users is perfectly fine and has an important place (plus, those same Users could already have been or they have the potential to become a type 2 User) the real "power" of (IMDb) data use is in the 2nd type.

IMDb wants to categorize/"lock" people in "typical", static/monolithic, use cases (which they can't because there is no such thing, when one gets access to such a rich and diverse quality data – the richer the data the more unlimited are the possibilities) without the understanding that is in exploration the true power and possibility of the unimpeded access people enjoyed until now. This is also the mode where more errors and problems with the data are bound to be discovered and, thus, reported/corrected.

So, they are (re)enabling type 1 Users (which is good), but type 2 Users (where the true value, for both IMDb and the Users/Contributors, really lies) are shackled and constrained in typical use-case boxes that has little or no real use to them (because the most desirable "consumer" of data, for IMDb, is the one that has no [pre}set idea what she/he'll do tomorrow with that data; the ones that ask themselves "What if...?" and then go about checking that out). And these Type 2 are also the main Contributors (if not to IMDb directly, at least indirectly) to the Film/TV community. They make all of us appreciate and understand all the interconnected nature of the art form. And that'll always return, in the end, to IMDb, in one form or another, because the more (quality) information and the bigger the community, the more people will turn to IMDb (because is one of the best, more popular, places to know more). Curtailing, impeding, Type 2 users, in any way, is a substantial self-inflicted wound (the proverbial "shot in the foot").

In spite of building one of the most successful stories of data gathering/maintenance in history they seem to lack a basic understanding of how this was achieved, how the whole direct/indirect feedback loop works, how the ample availability of their (almost) raw data was like seed(s) for fertile ground(s) -- where they could/would reap the benefits, severalfold, later, down the line, directly or indirectly. This kind of decisions seems to blissfully ignore how IMDb arrived at this point in time.

This is akin to the Captain of the Titanic failing to acknowledge the “invisible” 90% of the iceberg that lied below the line.

(Edited)
Photo of Nick

Nick

  • 3 Posts
  • 11 Reply Likes
I don't think I can or ever will understand or be able to wrap my head around this line of logic.

We created this awesome movie database, lets make an API
We made this awesome API, lets give everyone access to all the data!
20 years pass
Now that we've given everyone all the data for 20 years, and the API is old
lets update it!
People can't be using ALL our data right? No way that people are actually using all the data we provided them over the course of 20 years just because we can't see analytics for it. Not possible Nope.
Well just remove access to 9/10ths the data then.
Its fine if you guys change the format and work to update the way its parsed, removed redundancy etc, I am ALL for that, trust me you. But plain and simple, not giving me something I already use, is going to destroy so many projects, and development ecosystems, I don't think that we have a number that could go high enough to represent the amount you are killing by removing access to so much of the data. You're literally killing an un-ending amount of (infinity) projects.
(Edited)
Photo of Ron

Ron

  • 190 Posts
  • 131 Reply Likes
On the point about contributors, we are looking at extending the range of data available via the http solution based on your contribution history and volume.

We aim to also be able to grant specific permissions to specific customers for specific extra subsets of data as required on a case by case basis. This latter part may take some time to become a fully formed solution so please bear with us

Any update(s) on the above?
Photo of Manfred Polak

Manfred Polak

  • 5 Posts
  • 7 Reply Likes
That is a huge drawback for me. I'm not only a film buff, I also write articles about old and rare films for a non-commercial blog, and I do extensive research for that purpose. I'm using the IMDb list files for more than 15 years, and I download the updates almost every week (in the first years the diff files, then, since I had DSL, the list files). I wrote a script (which makes use of wget) that determines if new files are available, so the download is started automatically.

And yes, I need ALL list files. Not every file every day, but every file from time to time. I'm using the program AMDbFront (don't look for it - it has disappeared from the Internet since its author didn't develop it any more) to convert the files into a MySQL database. AMDbFront is also the viewer for the data. I'm using it in GUI mode almost daily, but sometimes I make complex queries using SQL. One example: A few years ago, scientists from Northwestern University developed a method to determine automatically which are the most culturally significant films (the winner was THE WIZARD OF OZ), and they used the IMDb list files for that purpose (movie-links.list in particular). Here is their paper:

http://www.pnas.org/content/112/5/1281.full

I managed to reproduce their most important result (the long-gap citation count) with my local IMDb data, using a SQL query I wrote. The cited article only covers US films, but I used the method to create respective lists for many other countries, and I published my results in the above-mentioned blog.

I also wrote a script (in VBScript) which adds a table to the MySQL database that contains all films I have on DVD or Blu-ray. The table contains the title (exactly as it's in movies.list) and flags for seen/unseen, region code and short/long films. That information is taken from a text file I maintain for that purpose.  With appropriate SQL queries, I can answer questions like "how many short films from France from the 1930s do I have on DVD" or "who is the actor/actress with whom I have the most films on DVD"?

Well, this all will become impossible with the new dataset format. I surely won't switch to it (even if it would be free of costs), but I will freeze my installation at the current state. That's the lesser of two evils for me.
Photo of sv

sv, Official Rep

  • 31 Posts
  • 18 Reply Likes
Official Response
We are currently working on asolution for data access via HTTP endpoint as an alternative to thedirect AWS S3 access. As part of this solution, we are also looking into a contributor exclusive solution, providing extended datasets based on a person’scontribution history and volume. Given these developments, we are postponing the shutdown of the IMDb FTP sites to November 7, 2017

Please stay tuned for more updates.  Thanks!
(Edited)
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
My takeaway from Col Needham's recent post was that the primary issue standing between providing complete data and not was the significant development effort to make the data available from the new systems.

But now, It sounds like from this post that IMDB will have the logic in place to provide fuller datasets, and will provide them to contributors.

Now it appears to be purely that IMDB wishes simply to prevent the public from accessing the collected data for free.

This appears to no longer be about development efforts, rather it appears to be exclusively about cutting off free data. 

Frankly, IMDB needs to get its story straight.  If the data is in fact available, and  more complete data sets would be there for us ... if we provide IMDB with free (aka donated) labor, then the entire premise of Col's justification looks shaky .. at best.  

I hope you guys come to your senses, but I no longer believe we are being told the real story here, so I have now completed moving my data sources away from IMDB, and will donate both time and money elsewhere.

I've enjoyed using IMDB's data since well before it sold out to Amazon, and its sad to see so many years of cooperation trampled by this ill considered project.
Photo of Jeorj Euler

Jeorj Euler

  • 7238 Posts
  • 9426 Reply Likes
Thanks for postponing the shutdown, IMDb staff. We can at least grant y'all that.
(Edited)
Photo of Jeorj Euler

Jeorj Euler

  • 7238 Posts
  • 9426 Reply Likes
Due to my notification settings, I was informed: "Nobody liked your comment". That's kind of funny like a pun. Thanks, Nobody.
Photo of Gardner von Holt

Gardner von Holt

  • 16 Posts
  • 28 Reply Likes
With the revised shutdown date now days away, I note that no promised updates have not occured.

Really a shame to see an organization lose its way.
Photo of Jeorj Euler

Jeorj Euler

  • 7238 Posts
  • 9426 Reply Likes
I can foresee the IMDb contributor base potentially splintering into "patriots" and "loyalists". I may be no patriot, but the loyalists are really starting to disgust me.
Photo of Jonathan Yoni Revah

Jonathan Yoni Revah

  • 4 Posts
  • 8 Reply Likes
The switch, in its currently implementation is horrible.

IMDb is a community-driven website that relies on the mass of users for nearly everything, from reviews to ratings, to episode and movie release dates yet somehow most of those things are missing from the data dumps. You owe it to the community to give back and 'complete' data dumps in the form of an S3 bucket where the devs pay for bandwidth is the least you could do.

The tens of thousands of users that left reviews or ratings didn't do so for the benefit of a corporation. We contribute information to large repositories like wikipedia or IMDb because we want people to have access to it, and we do so hoping that the gatekeepers will do their best to keep all out there and easily available...but instead you guys have gone the opposite way. Everything needs to be accessed through your interfaces or apps, what you do give back is anorexic in comparison to what you take, and yet you still rely on users to feed you information for your business model to even work...

I urge you to seriously reconsider this philosophy or at the very least have a moment of honesty with the developer community and explain yourselves better. There is no reason to have omitted all of this information and I'm starting to think that there is also no reason to contribute or rely on your website.

You guys have spent the past 30 years harvesting your users for data while providing decent dumps of your database, and now that we've all learned to rely on you guys, you're taking that away. Take a page from Google: "Don't be evil".
(Edited)
Photo of timon

timon

  • 8 Posts
  • 7 Reply Likes
"This more robust and reliable solution will replace the IMDb FTP sites, which will be retired on December 28, 2017."

How about actors.list.gz? It is not updated after Sep 22 2017?

-timon
Photo of Jeorj Euler

Jeorj Euler

  • 7265 Posts
  • 9467 Reply Likes
They're making off like bandits with a whole lot of customer-generated content.
Photo of timon

timon

  • 8 Posts
  • 7 Reply Likes
Who own this content? I have not red any "small print" carefully (att all).
Photo of Chris H.

Chris H., Employee

  • 76 Posts
  • 100 Reply Likes
Official Response
Thank you for the continued feedback.

Earlier in this thread, Col referred to a prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges. This system will require an ordinary IMDb user account attached to a valid email address. However, this system is not yet quite ready for production so to help address some of the concerns raised about the 'Requester Pays' access via S3, today we activated an https entry point to provide access to the basic datasets. This https location is here, https://datasets.imdbws.com/ The page http://www.imdb.com/interfaces/ has been updated with this information.

We are finalizing the extended datasets and access model and I will post an update about that as soon as it is ready. 

The final build of the data that gets published to the FTP mirrors occurred yesterday so those mirrors contain the final FTP snapshot. While the data on the FTP servers will not be updated going forward, we will not remove the data for at least the next few weeks so people who need that data can still download it.
 
Photo of Vincent Fournols

Vincent Fournols

  • 2901 Posts
  • 4837 Reply Likes
A first quick look at title.basics.tsv.gz is very promising, and much more satisfactory than the former FTP offer.
Could you please state the charset used to published the text files?
Photo of Chris H.

Chris H., Employee

  • 76 Posts
  • 100 Reply Likes
Hello Vincent,

    The files are in the UTF-8 character set. I have pushed out an update to the http://www.imdb.com/interfaces/ page to add that to the file details section.

Best regards,
Chris.
Photo of Owen Rees

Owen Rees

  • 203 Posts
  • 304 Reply Likes
My quick look at title.akas.tsv.gz suggests it is UTF-8 - tt0000008,5 is Cyrillic and looks sensible, emacs reports that the underlying file was UTF-8 and displays the characters correctly.

That Cyrillic aka title does not appear in the aka-titles.list.gz file on the FTP site.

It looks to me as if the new system is including data that the old system could not handle.
Photo of Manfred Polak

Manfred Polak

  • 5 Posts
  • 7 Reply Likes
The latest and final FTP snapshot is online now. But once again there is no updated actors.list! It's still from September 22. That's very annoying and embarrasing - 3 months should have been time enough to fix this problem. At least the final dump should contain recent files only.

By the way, on the German FTP server there is a new directory "frozendata". The files are now available both in
ftp://ftp.fu-berlin.de/pub/misc/movie... and
ftp://ftp.fu-berlin.de/pub/misc/movie..., but I guess that in the long run, only "frozendata" will remain.
Photo of lucacanali

lucacanali

  • 1 Post
  • 2 Reply Likes
Thanks for providing the HTTPS interface. It's really appreciated.

Something I don't understand is about the genre information. For all the movies that have more than 3 genres, only the first 3 in alphabetical order are reported.
For example for "Dunkirk" (tt5013056) the War and Thriller ones are omitted.
And Dunkirk really is a "War" film... I cannot get the reason for this limitation.

In fact, also imdb.com has this problem. The top bar lists only 3 genres, but then in the "Genres:" tag has all of them.

I would be also really great if you can add also the "Country" and the "Keywords" information. This info is really important to categorize movies. Without it I have to resort to scraping your site, that is something I would like to avoid.

Thanks,
Luca
Photo of chuck.kahn

chuck.kahn

  • 14 Posts
  • 7 Reply Likes
I don't see an equivalent to the ftp interface's business.list.  Will this be coming later to the S3 interface?
Photo of Brian Risselada

Brian Risselada

  • 30 Posts
  • 34 Reply Likes
Can we please get a response to this?
Photo of Tim Griffin

Tim Griffin

  • 1 Post
  • 3 Reply Likes
We have been using the FTP data in our undergraduate database course at the University of Cambridge, UK.  See https://www.cl.cam.ac.uk/teaching/1718/Databases/materials.html. I was very sad to see that this data is no longer available. Yes, they were a pain to parse, but that was a real issue. However it was one of the richest large data sets on the web that students did not require domain-specific training in order to understand.  I guess I'll be using the dump I snagged in August of 2017 for the next couple of years until I find another data source for the course. BTW, the new data file title.principals.tsv.gz has been empty every time I have looked.  perhaps this will contain the data I need? For example, we want actors in movies so we can do the "kevin bacon number" type queries. 

Photo of Chris H.

Chris H., Employee

  • 76 Posts
  • 100 Reply Likes
Thank you for reporting the missing data from the principals file, I have reported it to the tech team for fixing.

Best regards,
Chris.
Photo of Chris H.

Chris H., Employee

  • 76 Posts
  • 100 Reply Likes
title.princials.tsv.gz now has the data populated again.

Best regards,
Chris.