We are pleased to announce, starting today IMDb datasets are now available in Amazon S3 via an HTTPS link. Using the new interface, customers can bulk-access IMDb title and name data.
For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.
In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:
- Data refresh frequency is now daily (previously weekly).
- IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
- The files are in tab separated values (TSV) format.
- The sets of data we provide are updated to only include the essential ones that help with matching and linking to an IMDb title or name.
If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license. The license grants you access to our content via an XML web service, plus the right to use the content in your product or service. If that interests you, please email firstname.lastname@example.org.
If you have any questions or concerns, please share your feedback in this thread.
Thank you for your continued support.
This is a braindead idea. From someone who doesn’t understand (or doesn’t want to understand) why IMDb came to be the number one site for movie & TV data.
Countless Contributors offered their time, for free (and time is more valuable than money, once it’s gone, it’s gone, one cannot recover it, unlike money), during thousands of hours, spanning decades to make sure IMDb’s data was correct and up-to-date. Imagine how many employees, man-hours, health benefits and all the rest that IMDb saved since the start... Just imagine...
Then, once the data is there -- good, verified, and perfect – they sell it. Multiple times. Many, many times. More and more as time goes by. There are many ways they can profit from it (direct and indirect). Even when the access is free there are valuable tie-ins (you can buy the movie from Amazon for instance, movie theaters, merchandising) where money can be made. All of this is fine. Really. It’s a business. That is data and that it can be “resold” (because is better than data the other companies can offer) over and over, until infinity. It’s a good consequence of being just data and not something physical (that you cannot sell more than once).
Summarizing, until now: many, many people give away their time, for free, to make good data for IMDb, which in turn makes ample use of it, a many times as it wants, to make money.
Now picture this. You know those people, who give away their time to make something for free for you that you can make money later, as many times as you want? Here’s an idea, why not make them pay too? Right? Neat idea, isn’t it? They gave their time away so they should give away their money as well, right? Right.
A lousy analogy would be an Airline company making the pilots and crew to buy tickets to be able to board the plane they are supposed to take to its destination. They make the same trip, right? Why shouldn’t they pay, right? It makes perfect sense, right? Right.
The point is: it’s not that they will lose money if they don’t charge Contributors: that can be done in many, many other ways. It’s not that they need to do this. It’s that they want to do this.
I want to leave here some rhetorical questions that boggle the mind.
1 – Why should a Contributor (any Contributor) keep contributing after this? Why should anyone want to? Why should they not contribute their time where is really appreciated?
2 – Will you, from now on, start to pay the Contributors for their contribution? Because every coin has 2 sides. Because you cannot have it both ways, you cannot have your cake and eat it too: if you charge Contributors it means you say the information (which they provided in the first place) is valuable and so it should be remunerated. Or you don’t charge them for it, because you also got it from them for free. As I said, it can be both. Because we all have this things called little grey cells (borrowing from Poirot here)...
And if you retort: “It’s too difficult to pay, because it’s complicated accounting, because... “ and so on and so on... I’ll give you back your own “solution”. The “unnatural” solution you are trying to force-feed to people right now, but this one actually makes perfect sense. Also a simple solution. Give to each, say, 10 contributions of each Contributor an Amazon Gift Card with a token value, say, 1 Dollar/Euro/Pound. There. No need to thank me.
3 – Why do you thought you could do this and everyone would be fine with it? Why do you think it’s okay to ask for money from the one’s that help you most and they would be fine with it and everything would be fine afterwards? And don’t give me the “added value” line because it doesn’t pass the smell test: you cannot add value if you take away loads and loads of data. Just can’t.
4 – Why is this so rushed and quiet and through the summer (I bet many aren’t even aware of this)?
As I said, rhetorical questions.
Braindead, I say.
What I do have a problem with is moving to a paid model that does not support the same 40+, frequently updated, datasets that have been provided for free for so many years. I think IMDb need to re-address this decision as it will affect people like myself.
I download and insert all datasets into a MySQL database. This has allowed me to develop specific applications and just data mine (discovering new uses for the data). I've taught myself SQL from data-mining IMDb data.
So, I'm now in the position of urging for a much more comprehensive data set to be made available or for the advanced search tools to be more advanced. In addition, basically, Luca Canali is right. Amazon/IMDb is taking things away from us. Whatever it is that we will be gaining in return remains to be seen.
On the S3 access issues, we now have a working prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges. Please watch for an announcement as we convert this into production code. The only thing needed will be an ordinary IMDb user account attached to a valid email address. We still intend to also make the data available via S3 for those people who find the AWS access tools more convenient and can stay within the free tier of AWS.
On the general data availability, we are adding the AKA titles to the basic data set accessible to everyone. Longer term, we are looking at the possibility of daily diff files for at least some of the data in the basic set.
On the point about contributors, we are looking at extending the range of data available via the http solution based on your contribution history and volume. For top contributors and those people using the data to help us clean it via bulk corrections, this is likely to extend far beyond the current set of data even on the FTP site. It is not our intention to deprive access to the data by those people who have genuinely helped to build it over the years and who want to continue to improve IMDb. We aim to also be able to grant specific permissions to specific customers for specific extra subsets of data as required on a case by case basis. This latter part may take some time to become a fully formed solution so please bear with us.
The background to all of this is that there is a huge multi-year technology migration project which is nearing completion at IMDb. We have too many complicated old systems around which have been slowing the overall pace of development (I add a bit more detail to this on https://getsatisfaction.com/imdb/topics/why-doesnt-imdb-staff-ever-consult-with-the-contributor-base...). The move to the new technology has been providing the opportunity to look at the way we operate different parts of the IMDb service. One of the oldest software systems is the one which publishes the FTP data, and we will soon no longer to even be able to generate the .list files once the final pieces of the old IMDb system are decommissioned; at least not without re-writing all of the publication software to connect to the new system and produce an extremely difficult to manipulate text file format which was designed 27 years ago and has not changed in 21 years. Instead, we decided that it would be better to publish the data via a modern system (S3 and soon over https) in a modern format which can be more easily parsed. The other problem with FTP is that we have no idea how many people are using the data and for what purpose, nor do we know what additional things they may want from the data. From feedback over the years, we knew some of your requirements already, notably (a) access to the title and name constant data (b) an easier to parse format (c) information to help in matching other catalogs to IMDb (d) more frequent updates. We found ourselves having to guess the remaining requirements until we decided the best way forward was to move the data to a new location within the FTP sites, post an announcement on Get Satisfaction (this thread) and then wait to gather feedback before replying and figuring out what steps to take next (this reply).
We hope this helps. We have plenty to be working upon in the meantime, and we will follow-up as we deliver parts of the above.
Founder & CEO, IMDb.com.
And yes, I need ALL list files. Not every file every day, but every file from time to time. I'm using the program AMDbFront (don't look for it - it has disappeared from the Internet since its author didn't develop it any more) to convert the files into a MySQL database. AMDbFront is also the viewer for the data. I'm using it in GUI mode almost daily, but sometimes I make complex queries using SQL. One example: A few years ago, scientists from Northwestern University developed a method to determine automatically which are the most culturally significant films (the winner was THE WIZARD OF OZ), and they used the IMDb list files for that purpose (movie-links.list in particular). Here is their paper:
I managed to reproduce their most important result (the long-gap citation count) with my local IMDb data, using a SQL query I wrote. The cited article only covers US films, but I used the method to create respective lists for many other countries, and I published my results in the above-mentioned blog.
I also wrote a script (in VBScript) which adds a table to the MySQL database that contains all films I have on DVD or Blu-ray. The table contains the title (exactly as it's in movies.list) and flags for seen/unseen, region code and short/long films. That information is taken from a text file I maintain for that purpose. With appropriate SQL queries, I can answer questions like "how many short films from France from the 1930s do I have on DVD" or "who is the actor/actress with whom I have the most films on DVD"?
Well, this all will become impossible with the new dataset format. I surely won't switch to it (even if it would be free of costs), but I will freeze my installation at the current state. That's the lesser of two evils for me.
The HTTP access should be a good alternative way to obtain the data i guess.
And i'm sure that adding the AKA titles will help a lot of users, including myself.
Will by any chance, the languages of the movies, be included as well ?
Because AKA titles are mainly important when dealing with non English movies, but i think there is no possibility with the new data files to determine which movie is English or non-English.
I was wondering if there is a channel I could ask IMDB authorisation for using IMDB movie synopsis data in my thesis.
I am aware of the 'IMDb Data – Now available in Amazon S3' announcement, but I was not able to find an interface that would publish movie synopsis.
Your response is greatly appreciated.
Please stay tuned for more updates. Thanks!
I can understand why a transition needs to be made and that its not easy to achieve parity in the new system. My use case is a little different than the others on this list so I thought I'd chime in.
I do economic research on the television industry; asking questions like how characteristics of production companies affects the quality of the show. To that end I use a bunch of data that's not included in the new subset on S3. AKA titles definitely as mentioned by others; they are useful for matching across different datasets. The distributor and production company lists files helps me track which shows were on which networks as well as affiliated with each production company. Producer and writer lists lets me for example connect a show with an Emmy winning executive producer or creator. Lists like language and runtime help me screen out noise from the data, especially for shows that were not very popular and may be incorrectly labelled in other variables. And the full set of genres is important to capture all the shows in a category; if a show has more than 3 genres it may not be included if for example I try to understand what is relevant for ratings in comedies. For ratings, knowing the distribution of ratings is useful to understand how targeted a show was.
Anyway I hope that is useful. Happy to expand on this more if it would help with your triaging of features.
I just checked and I have submitted 1,014 updates to IMDB, which I suspect would appear to be a pittance compared to the top contributors, but I hope it is still evidence that I care about the completeness and accuracy of the IMDB database.
One of my favorite things on the IMDB site is the advanced search, but I wanted to do so much more and I wanted to integrate the data with my own custom algorithms and personal logs. I discovered the .list files provided by IMDB and this has been my basis for an exciting adventure. I am not a professional IT person or programmer. But to do what I wanted I taught myself Access and eventually SQL and Python and Django. I only download the files again about every year because I don't watch a lot of new films but I love adding and updating the data for older and more obscure films.
I'll admit that when I first started, even though I didn't have any database or programming experience, I found the .list files seemed antiquated, but I wrote my own programs to extract the data and place them into an SQL database to work the way I wanted it to. So I am kind of glad that IMDB is moving to some new structures that will hopefully be easier to use even though it will probably mean countless hours for me to rewrite a lot of my program for getting the updated data into my database.
The thing I am most concerned about though is making sure all of the data is still available. I use almost all of it, and what is currently in the S3 files would not be worth me using anymore. So I am glad that it has been indicated that there may be means for the rest of the data to be available still. I will list the information below that I use and how I use it.
What I use the most:
Movie (with stats like release year, number of votes, and rating)
Company (production and distribution)
Country of production
Person (all persons of all roles, and all of their credits on all movies, but most of all directors)
What I also use often:
Film negative format
Printed film format
What I also use but rarely:
Also I used ALL OF THE ATTRIBUTES of these items
All of this is strictly for personal use. It is primarily for me to log films I've seen and track data about them and to run interesting queries in the database to find interesting results and patterns within the database regarding connections between all of these stats listed above. However it also can result in me identifying information that needs to be updated in the IMDB database that I can then submit updates for. I have had thoughts before about some day creating something available for public use, but that may be a pipe dream, and if I did ever do it I would of course pay for the license to do so.
So I hope that through this transition that all of this information will be made available in a complete and easy way for contributes like me who wish to have the information.
One other thing I'd like to request: the most useful thing you could add to the datasets you make available would be for movies and for the people to list the IMDB ID used in the IMDB URL for each of the movies. For instance in the dataset for movies for the entry for "The Godfather" it would list the ID as tt0068646 which corresponds to the webpage for "The Godfather" which is http://www.imdb.com/title/tt0068646/. Or for instance for the person Alfred Hitchcock the ID nm0000033 which refers to his page at http://www.imdb.com/name/nm0000033/
Thank you for considering my situation.
IMDb is a community-driven website that relies on the mass of users for nearly everything, from reviews to ratings, to episode and movie release dates yet somehow most of those things are missing from the data dumps. You owe it to the community to give back and 'complete' data dumps in the form of an S3 bucket where the devs pay for bandwidth is the least you could do.
The tens of thousands of users that left reviews or ratings didn't do so for the benefit of a corporation. We contribute information to large repositories like wikipedia or IMDb because we want people to have access to it, and we do so hoping that the gatekeepers will do their best to keep all out there and easily available...but instead you guys have gone the opposite way. Everything needs to be accessed through your interfaces or apps, what you do give back is anorexic in comparison to what you take, and yet you still rely on users to feed you information for your business model to even work...
I urge you to seriously reconsider this philosophy or at the very least have a moment of honesty with the developer community and explain yourselves better. There is no reason to have omitted all of this information and I'm starting to think that there is also no reason to contribute or rely on your website.
You guys have spent the past 30 years harvesting your users for data while providing decent dumps of your database, and now that we've all learned to rely on you guys, you're taking that away. Take a page from Google: "Don't be evil".
Thank you for your continued feedback. As we review your feedback and work on the HTTP solutions described by Col and sv above, we are revising the shutdown date of the IMDb FTP sites to December 28, 2017.
More updates will be provided closer to that time. Thank you.Chris
Could some IMDb staff confirm when will the last batch of flat files/data sets be issued and made available on the FTP sites?
The Berlin one still mentions 2017-09-10, a Col Needham post mentions November, 7, and the latest set available was issued on November 24...
Thank you in advance for some clarification of the roadmap!
Earlier in this thread, Col referred to a prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges. This system will require an ordinary IMDb user account attached to a valid email address. However, this system is not yet quite ready for production so to help address some of the concerns raised about the 'Requester Pays' access via S3, today we activated an https entry point to provide access to the basic datasets. This https location is here, https://datasets.imdbws.com/ The page http://www.imdb.com/interfaces/ has been updated with this information.
We are finalizing the extended datasets and access model and I will post an update about that as soon as it is ready.
The final build of the data that gets published to the FTP mirrors occurred yesterday so those mirrors contain the final FTP snapshot. While the data on the FTP servers will not be updated going forward, we will not remove the data for at least the next few weeks so people who need that data can still download it.
I've invested a tremendous amount of time in creating code to parse these dumps. At my current bill rate, it's tens of thousands of dollars. As a film enthusiast and software professional, these dumps were a great way to do both at the same time.
While this is very sad for my home, for-fun project, I think it is unfair to think that Amazon is doing this as a money grab. Amazon has world class data infrastructure in AWS, and it's only natural that eventually they would want to move people onto AWS and away from legacy systems that were built in the 90s. I don't know much about the internals of IMDb, but I would expect that the system that created the FTP dumps is at least a couple generations away from the system that feeds IMDb.com.
That said, are there any tutorials on specifically getting IMDb data through S3? I am pretty comfortable with the programming involved, but have not worked with Amazon S3 before. imdb.com/interfaces states the entry point is https://datasets.imdbws.com/, but I need details on how to construct the SOAP or REST calls.
i was so happy that this new file includes the language of the movie as well now
but either the languages of the movies are wrong in that file, or something is missing
if a movies does not have an akas title, is it by default in english language ?
and why have movies with type = original and isOriginalTitle = 1 no language defined at all ?
thanx in advance for any help
this seems to go into the right direction
I would like to download special customized fields to my watchlist download or customized list that include Title, Release Year, Country of origin, type of title (movie, tv, miniseries, etc...), cast + character, director, description. The watchlist download currently has title, full date, director, type of show, and numerous links & ratings (that I do not want). WHO CAN HELP ME WITH THIS??? If I need IMDBpro, I will definitely get it. By the way does Amazon own IMDB? What about the software programs referenced -- do I need those like open source software like Linx, Apache, GNU and Linux utilities. I am just sole proprietor helping an inmate with compiling movie data not a major corporation. PLEASE HELP PLEASE HELP. YOU CAN REACH ME AT email@example.com, firstname.lastname@example.org, email@example.com and/or 540 915 0683
in ttprincipals.principalCast (and probably the other multivalued fields), I cannot figure out the sorting criteria: it is neither the one displayed on screen, nor the nn9999999 code itself, nor the resulting alphabetical order.
Please, could an IMDB rep clarify this?
Thanks in advance.
According to your website: (http://www.imdb.com/interfaces/) "The dataset files can be accessed and downloaded from https://datasets.imdbws.com/. The data is refreshed daily." I'm looking at the downloadable file: name.basics.tsv.gz , according to that file (downloaded 12/30/2017)...Victor Brooks (nm0003499) is not deceased, but if you look up nm0003499 on your website, he died in 1999. Same for Leslie Adams (nm0011145)...he is alive according to name.basics.tsv.gz, but if you look nm0011145 up on your website he is deceased as of 1993. Are these dataset files no longer updated? Thanks