We are pleased to announce, starting today IMDb datasets are now available in Amazon S3 via an HTTPS link. Using the new interface, customers can bulk-access IMDb title and name data.
For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.
In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:
- Data refresh frequency is now daily (previously weekly).
- IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
- The files are in tab separated values (TSV) format.
- The sets of data we provide are updated to only include the essential ones that help with matching and linking to an IMDb title or name.
If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license. The license grants you access to our content via an XML web service, plus the right to use the content in your product or service. If that interests you, please email email@example.com.
If you have any questions or concerns, please share your feedback in this thread.
Thank you for your continued support.
I use aka-titles.list to match TV listing data to IMDB info, because the TV listing data often uses an alternative name for movies.
For ratings.list, unless I am missing something, the top 250 information is not available in the new format, which I use to highlight those movies in the TV listings as well as to track my progress in watching those movies.
I use language.list to show the original language of movies as opposed to the language being broadcasted.
The keyword.list data is also required, as it allows to categorize movies.
Also please continue to share the data file with FTP/HTTP. Most people won't start using and paying S3 for their personal projects just to obtain such data.
You are just forcing people to start scraping imdb.com to obtain the same information that before was easily and freely obtainable from FTP/HTTP.
I wonder what he would think of this direction for his project, it certainly flies in the face of the entire nature of IMDB's history and values.
Those that continue to need this sort of access, might want to check out https://www.themoviedb.org
This "move" from "IMDb" seems unnecessary because there other models that would work in reducing operational costs and data consumption greatly without the need to subject the bona fide data consumers to this convoluted, Rube Goldberg-esque, nonsensical process: the first that comes to mind, without putting too much thought into it, and with trivial impact (for both interested parties) it would be to maintain the FTP process but, instead of being anonymous, make it to a previously assigned user:password (maybe the same as the IMBd login credentials?) and to a specific time-based quota (X files/bytes could be downloaded daily/monthly/yearly/whatever per FTP/IMDb user). That would separate the "abusers" from the "little guy" (it's seemingly ironical how the ones that would be most damaged by the "new" process would be the ones that "consume" the less and, probably, contribute the most...; this "move" equates to shooting birds with a cannon, not the right tool for the job: if the aim is to deter and fight abuse [I infer, as no "plausible" motive was forwarded, the other alternative being some misguided corporate bottom-line measure that simply didn't account for all the factors in their equation -- meaning, there's more to be lost by possibly alienating contributors than to be gained] why does it "burn" all the rest -- incidental users: the vast majority, I surmise -- at the same time? So unfair and a whole lot of "bad-Karma points"...)
Plus, this "move" from "IMDb" seems also shortsighted because it fails to understand the fundamental reason why IMDb is, well, IMDb. It wouldn't be the same quality-wise (meaning data quality) and quantity-wise without the army(s) of busy bees (AKA contributors) constantly mending/tending/pruning/trimming/grafting, 24/7, every square inch of their "digital property". Try to convert that (if you can) to the army(s) of employees that had to be payed (plus benefits, plus ...) to work less hours and with lesser effectiveness (a labor of love, isn't really labor, and it can't really be matched). That would be the scenario of having IMDb without happy/motivated contributors (which is where "moves" just like this one would lead, eventually, down the road).
And if IMDb considers that this isn't a problem, because it's "too big to fail", right now and it can do as it pleases just ... because, I'll leave you with this 'food for thought': if (hopefully not) for some unfortunate reason IMDb would cease to exist today, and all the data would be lost, a new IMDb would rise tomorrow (let's call it for argument's sake, NewIMDb), the (existing) contributors would flock over it and in a short amount of time something resembling (Old)IMDb would be there, standing proud and tall -- (Old)IMDb nothing but a fading memory... If the exact opposite were to happen (meaning, all the contributors ceased, for some unwanted, unforeseen reason, to stop supporting, visiting and contributing every minute to the collective hive that is IMDb, the same (i.e., IMDb) would slowly -- but surely -- wither and "die" (vanish).
The longevity (and quality) of IMDb is due to its openness and two-way street nature, not in-spite of it.
Final disclaimer: I love IMDb and want it to outlive the planet itself, but I couldn't just watch doing this to itself and stand idle. I also know this isn't a "conversation" (a conversation doesn't start: "we are doing this, what do you think?"; a conversation starts with: "what do you think of the possibility of changing from this to that?") but I hope smarter (smartest) heads will prevail. For IMDb's sake and success. Including in that all of IMBd's ecosystem. Thank you.
PS: this what I would also have written if, all of a sudden, Wikipedia started erecting some kind of fences/barriers to some type of its content. NewWikipedia would be fast in its place, without missing a beat...
Thank you for your continued support.
I get that you guys want money and promoting s3, but its kind of a dirty tactic to force me to pay for something that I have spent almost two years developing and creating my own little in house 'Netflix' service for my personal collection of movies and tv, around all of the data that you guys so graciously provided us with prior.
My system was already setup to prevent strain on bandwidth on your end, only once a month it would check for updates, and then re-download updates/diffs, because I use it all...
I use the titles, all the akas, keywords/taglines/writers/quotes/ratings-reasons/locations/genres/composers/actresses/actors/directors/countries/distributors are all intricately tied into my searching, running-times, ratings, release dates, and various other files are all meshed into a distrusted MySQL database allowing me near real-time data draw.
To know I am going to HAVE to re-build, re-structure, and re-engineer the entire way my system is built around you data truly means I will probably have to scrap, and abandon the entire project.
Albeit it was just for my own personal fun it saddens me to see you guys make a change that breaks two years of my fun time work.
For example, you can have daily updates on S3 and monthly updates on FTP.
In this way you can both promote S3, and at the same time don't disrupt personal and open source projects. S3 is just not an option for a lot of us.
Having keywords info would be also very good.
This is a braindead idea. From someone who doesn’t understand (or doesn’t want to understand) why IMDb came to be the number one site for movie & TV data.
Countless Contributors offered their time, for free (and time is more valuable than money, once it’s gone, it’s gone, one cannot recover it, unlike money), during thousands of hours, spanning decades to make sure IMDb’s data was correct and up-to-date. Imagine how many employees, man-hours, health benefits and all the rest that IMDb saved since the start... Just imagine...
Then, once the data is there -- good, verified, and perfect – they sell it. Multiple times. Many, many times. More and more as time goes by. There are many ways they can profit from it (direct and indirect). Even when the access is free there are valuable tie-ins (you can buy the movie from Amazon for instance, movie theaters, merchandising) where money can be made. All of this is fine. Really. It’s a business. That is data and that it can be “resold” (because is better than data the other companies can offer) over and over, until infinity. It’s a good consequence of being just data and not something physical (that you cannot sell more than once).
Summarizing, until now: many, many people give away their time, for free, to make good data for IMDb, which in turn makes ample use of it, a many times as it wants, to make money.
Now picture this. You know those people, who give away their time to make something for free for you that you can make money later, as many times as you want? Here’s an idea, why not make them pay too? Right? Neat idea, isn’t it? They gave their time away so they should give away their money as well, right? Right.
A lousy analogy would be an Airline company making the pilots and crew to buy tickets to be able to board the plane they are supposed to take to its destination. They make the same trip, right? Why shouldn’t they pay, right? It makes perfect sense, right? Right.
The point is: it’s not that they will lose money if they don’t charge Contributors: that can be done in many, many other ways. It’s not that they need to do this. It’s that they want to do this.
I want to leave here some rhetorical questions that boggle the mind.
1 – Why should a Contributor (any Contributor) keep contributing after this? Why should anyone want to? Why should they not contribute their time where is really appreciated?
2 – Will you, from now on, start to pay the Contributors for their contribution? Because every coin has 2 sides. Because you cannot have it both ways, you cannot have your cake and eat it too: if you charge Contributors it means you say the information (which they provided in the first place) is valuable and so it should be remunerated. Or you don’t charge them for it, because you also got it from them for free. As I said, it can be both. Because we all have this things called little grey cells (borrowing from Poirot here)...
And if you retort: “It’s too difficult to pay, because it’s complicated accounting, because... “ and so on and so on... I’ll give you back your own “solution”. The “unnatural” solution you are trying to force-feed to people right now, but this one actually makes perfect sense. Also a simple solution. Give to each, say, 10 contributions of each Contributor an Amazon Gift Card with a token value, say, 1 Dollar/Euro/Pound. There. No need to thank me.
3 – Why do you thought you could do this and everyone would be fine with it? Why do you think it’s okay to ask for money from the one’s that help you most and they would be fine with it and everything would be fine afterwards? And don’t give me the “added value” line because it doesn’t pass the smell test: you cannot add value if you take away loads and loads of data. Just can’t.
4 – Why is this so rushed and quiet and through the summer (I bet many aren’t even aware of this)?
As I said, rhetorical questions.
Braindead, I say.
What I do have a problem with is moving to a paid model that does not support the same 40+, frequently updated, datasets that have been provided for free for so many years. I think IMDb need to re-address this decision as it will affect people like myself.
I download and insert all datasets into a MySQL database. This has allowed me to develop specific applications and just data mine (discovering new uses for the data). I've taught myself SQL from data-mining IMDb data.
On the S3 access issues, we now have a working prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges. Please watch for an announcement as we convert this into production code. The only thing needed will be an ordinary IMDb user account attached to a valid email address. We still intend to also make the data available via S3 for those people who find the AWS access tools more convenient and can stay within the free tier of AWS.
On the general data availability, we are adding the AKA titles to the basic data set accessible to everyone. Longer term, we are looking at the possibility of daily diff files for at least some of the data in the basic set.
On the point about contributors, we are looking at extending the range of data available via the http solution based on your contribution history and volume. For top contributors and those people using the data to help us clean it via bulk corrections, this is likely to extend far beyond the current set of data even on the FTP site. It is not our intention to deprive access to the data by those people who have genuinely helped to build it over the years and who want to continue to improve IMDb. We aim to also be able to grant specific permissions to specific customers for specific extra subsets of data as required on a case by case basis. This latter part may take some time to become a fully formed solution so please bear with us.
The background to all of this is that there is a huge multi-year technology migration project which is nearing completion at IMDb. We have too many complicated old systems around which have been slowing the overall pace of development (I add a bit more detail to this on https://getsatisfaction.com/imdb/topics/why-doesnt-imdb-staff-ever-consult-with-the-contributor-base...). The move to the new technology has been providing the opportunity to look at the way we operate different parts of the IMDb service. One of the oldest software systems is the one which publishes the FTP data, and we will soon no longer to even be able to generate the .list files once the final pieces of the old IMDb system are decommissioned; at least not without re-writing all of the publication software to connect to the new system and produce an extremely difficult to manipulate text file format which was designed 27 years ago and has not changed in 21 years. Instead, we decided that it would be better to publish the data via a modern system (S3 and soon over https) in a modern format which can be more easily parsed. The other problem with FTP is that we have no idea how many people are using the data and for what purpose, nor do we know what additional things they may want from the data. From feedback over the years, we knew some of your requirements already, notably (a) access to the title and name constant data (b) an easier to parse format (c) information to help in matching other catalogs to IMDb (d) more frequent updates. We found ourselves having to guess the remaining requirements until we decided the best way forward was to move the data to a new location within the FTP sites, post an announcement on Get Satisfaction (this thread) and then wait to gather feedback before replying and figuring out what steps to take next (this reply).
We hope this helps. We have plenty to be working upon in the meantime, and we will follow-up as we deliver parts of the above.
Founder & CEO, IMDb.com.
And yes, I need ALL list files. Not every file every day, but every file from time to time. I'm using the program AMDbFront (don't look for it - it has disappeared from the Internet since its author didn't develop it any more) to convert the files into a MySQL database. AMDbFront is also the viewer for the data. I'm using it in GUI mode almost daily, but sometimes I make complex queries using SQL. One example: A few years ago, scientists from Northwestern University developed a method to determine automatically which are the most culturally significant films (the winner was THE WIZARD OF OZ), and they used the IMDb list files for that purpose (movie-links.list in particular). Here is their paper:
I managed to reproduce their most important result (the long-gap citation count) with my local IMDb data, using a SQL query I wrote. The cited article only covers US films, but I used the method to create respective lists for many other countries, and I published my results in the above-mentioned blog.
I also wrote a script (in VBScript) which adds a table to the MySQL database that contains all films I have on DVD or Blu-ray. The table contains the title (exactly as it's in movies.list) and flags for seen/unseen, region code and short/long films. That information is taken from a text file I maintain for that purpose. With appropriate SQL queries, I can answer questions like "how many short films from France from the 1930s do I have on DVD" or "who is the actor/actress with whom I have the most films on DVD"?
Well, this all will become impossible with the new dataset format. I surely won't switch to it (even if it would be free of costs), but I will freeze my installation at the current state. That's the lesser of two evils for me.
Please stay tuned for more updates. Thanks!
IMDb is a community-driven website that relies on the mass of users for nearly everything, from reviews to ratings, to episode and movie release dates yet somehow most of those things are missing from the data dumps. You owe it to the community to give back and 'complete' data dumps in the form of an S3 bucket where the devs pay for bandwidth is the least you could do.
The tens of thousands of users that left reviews or ratings didn't do so for the benefit of a corporation. We contribute information to large repositories like wikipedia or IMDb because we want people to have access to it, and we do so hoping that the gatekeepers will do their best to keep all out there and easily available...but instead you guys have gone the opposite way. Everything needs to be accessed through your interfaces or apps, what you do give back is anorexic in comparison to what you take, and yet you still rely on users to feed you information for your business model to even work...
I urge you to seriously reconsider this philosophy or at the very least have a moment of honesty with the developer community and explain yourselves better. There is no reason to have omitted all of this information and I'm starting to think that there is also no reason to contribute or rely on your website.
You guys have spent the past 30 years harvesting your users for data while providing decent dumps of your database, and now that we've all learned to rely on you guys, you're taking that away. Take a page from Google: "Don't be evil".
Earlier in this thread, Col referred to a prototype of a system which can make the same S3 data available to you via HTTP from IMDb directly without requiring any S3 registration and free from any possibility of AWS charges. This system will require an ordinary IMDb user account attached to a valid email address. However, this system is not yet quite ready for production so to help address some of the concerns raised about the 'Requester Pays' access via S3, today we activated an https entry point to provide access to the basic datasets. This https location is here, https://datasets.imdbws.com/ The page http://www.imdb.com/interfaces/ has been updated with this information.
We are finalizing the extended datasets and access model and I will post an update about that as soon as it is ready.
The final build of the data that gets published to the FTP mirrors occurred yesterday so those mirrors contain the final FTP snapshot. While the data on the FTP servers will not be updated going forward, we will not remove the data for at least the next few weeks so people who need that data can still download it.
All entrys have been replaced with \N now. As far as i can see not a single movie has a genre anymore.
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
tt0000001 short Carmencita Carmencita 0 1894 \N 1 Documentary,Short
tconst titleType primaryTitle originalTitle isAdult startYear endYear runtimeMinutes genres
tt0000001 short Carmencita Carmencita 0 1894 \N 1 \N
Did this happen by accident? Or is this Data not available anymore period?