API/Bulk Data Access

  • 19
  • Idea
  • Updated 1 year ago
  • Implemented
Hi!

We’re in the process of reviewing how we make our data available to the outside world with the goal of making it easier for anyone to innovate and answer interesting questions with the data. If you use our current ftp solution to get data [http://www.imdb.com/interfaces] or are thinking about it, we’d love to get your feedback on the current process for accessing data and what we could do to make it easier for you to use in the future. We have some specific questions below, but would be just as happy hearing about how you access and use IMDb data to make a better overall experience.

1. What works/doesn’t work for you with the current model?
2. Do you access entertainment data from other sources in addition to IMDb?
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
5. Does how you plan on using the data impact how you want to have it delivered?
6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
7. Are our T&Cs easy for you to understand and follow?


Thanks for your time and feedback!

Regards,

Aaron
IMDb.com
Photo of Gideon

Gideon, Employee

  • 6 Posts
  • 4 Reply Likes

Posted 4 years ago

  • 19
Photo of Mike Jensen

Mike Jensen

  • 1 Post
  • 0 Reply Likes
1. What works/doesn't work for you with the current model?
I'm having a hard time getting the correct data. The structure of the whole dump is kindda confusing, and the format is not easy to work with
2. Do you access entertainment data from other sources in addition to IMDb?
Movie related, no.
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
A large CSV file, for the abbility to import the whole database would be really nice
4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
Definately, as long as there were no restrictions on connections or number of calls
5. Does how you plan on using the data impact how you want to have it delivered?
No
6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
JSON would be just fine
7. Are our T&Cs easy for you to understand and follow?
Yes
Photo of Chris Boshoff

Chris Boshoff

  • 1 Post
  • 0 Reply Likes
I vote for a JSON REST api
Photo of quaz

quaz

  • 1 Post
  • 0 Reply Likes
1. What works/doesn’t work for you with the current model?
It works within a quite complicated desktop database, I've been programming for over 10 years now. At some point I used most of the offline files with help of amdb, but this was rather cumbersome. Now I parse some files directly, but I quite hate it. Mostly due to the lack of the IMDb-ID. I regularly only use ratings.list at the moment. My sweet spot for the amount of movies is 50k, although I keep using the whole 3.3 million movies.list for checking. (The 50k include 500m+ votes, the whole rest has a measly 10% of that)
All in all I'm primarily happy that the current model exits at all.

2. Do you access entertainment data from other sources in addition to IMDb?
Yes, several. But unfortunately my second favourite movie site moviepilot.de discarded its API some time ago. And deep linking with an IMDb-ID stopped some time after that.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
YAY, primary keys. Although 200+ chars long movie titles can be primary keys too... :(
I don't know how many hours and perhaps days of my life I wasted, only because the existing text files don't contain the IMDb-ID. So more useful for sure. Downloading huge amounts of data is really a big problem for me nowadays, but I'd be happy to get only parts of the data from time to time.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
Depends on the abilities and restrictions of the API. If I could get the whole data for my 50k within a reasonable time (let's say a month), it could be an improvement. Since my old method to update my ratings doesn't work any more, I'd especially like the possibility to cast votes via API. I got several thousand in queue :)

5. Does how you plan on using the data impact how you want to have it delivered?
I take what I can get. Provide convenient ways to get the data and I'll think of a usage. The longer I think about it, the nicer API sounds.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
JSON is fine. I wouldn't mind some kind of Microsoft database files, but that's only me :)

7. Are our T&Cs easy for you to understand and follow?
Easy to understand: somewhat. Easy to follow: no :)


Is there a time frame for the possible changes?
Photo of Ballstone International

Ballstone International

  • 1 Post
  • 2 Reply Likes
Parsing your data is a bit prohibitive. Please move towards any standard. CSV, XML, or JSON.

My preference is a simple CSV format. Each file can contain the id and the relevant data. Let us import the data to the database that suits us. Users of different platforms can post scripts if they are required.

JSON is another option, but there is no need to get fancy. Target the lowest common denominator. Keep it simple. CSV please.

If you want to get fancier with CSV, offer a dynamic export that includes the columns required by the user.

Maybe also consider .xz compression.

For the interactive api, maybe some people would like that, but I prefer downloading the full dumps. An api would be an appropriate place for JSON.

Thanks for providing the data. I appreciate you consideration.
Photo of Joe

Joe

  • 3 Posts
  • 0 Reply Likes
aka-titles.list hasn't been updated on the ftp sites since April! This is likely a simple bug -- can you please, please check into it and get it working again? Thank you!

1. The current model has a barrier to entry but works well after overcoming that. If that's the price of data that's usable under the current licese, so be it.
2. Yes, I access other data sources.
3. You have primary keys now (the imdb-style title). Having primary keys that are the IMDB ID (e.g., tt0000000) would be nice.
4. An API would be horrible. I want to analyze the data offline -- all of the data. Bulk transfer of the whole database is the way to go.
5. Of course, how I use the data impacts how I want it delivered. I analyze ALL of it at once, so an API would not help me.
6. JSON would be fine. I'm arguing in 4 and 5 for a bulk transfer method. I don't care about the format after the bulk transfer.
7. Yes. I would love it if you could consider something more open, like one of the CC licenses.
Photo of Alex Bigelow

Alex Bigelow

  • 2 Posts
  • 0 Reply Likes
1. I REALLY like the ability to download flat files - they're totally a manageable size, so people should be able to build their own databases / etc if they need features you don't provide.

I agree with the comments to move toward some kind of standard (CSV, JSON, XML, whatever)... the files are tricky to parse in their current formats.

2. Yes and no... I've linked with Rotten Tomatoes data before, and I'm currently looking into linking with MovieLens.

3. I guess it depends on what "primary key" refers to (movies? users? actors? roles? all of the above?)... for my purposes, I'd probably just generate my own IDs anyway, so I don't think it would matter too much. But I'm sure there is some use case where they could be a big help.

4. An API would certainly be nice, but I don't think it's really that necessary - the files aren't that big, so people should be able to build their own query tools. And I'd definitely not want an API if it meant we could no longer download flat files.

5. Definitely! This is a standard rule across all data analysis - your tasks guide the structure you choose. Even with an API, people are going to probably end up doing a lot of custom reshaping anyway.

6. JSON is absolutely sufficient.

7. This page is pretty straightforward: http://www.imdb.com/help/show_leaf?usedatasoftware

Maybe some wording could be clarified, or maybe examples would help. E.g. does "individual personal use" mean that I can use IMDB as a test dataset locally on my machine for a research project, as long as I don't expose the IMDB data to the public? What if my research project is commercial - but I'm selling a system, not the data, and only using the data to demonstrate the system?

This is a really crazy corner case, but information about how to cite IMDB in academic publications would also be useful.
Photo of Simo Tuokko

Simo Tuokko

  • 4 Posts
  • 1 Reply Like
1. I find document-formats such as JSON or XML bad for really big data-sizes, assuming the whole file would be one big document.

However I could go with format where each line would be separate json-object: actor, movie, role etc

This was it way it would be easy to parse, you could even do it parallel. You could handle it with unix-command line tools.

3. IDs (and or primary-keys) are a must to link objects of different types to each other. However I assume IMDB left these out from the free data on purpose not to make to data "too useful". Having the correct IDs would allow you to link your custom app to IMDB-website for example.

---

I don't think file-sizes as is are problematic also storing data as Gzipped is ok as you can really easily unpack it just by adding GzipInputStream (in Java) into the mix. I hope it's as easy in other languages as well.

Having big files would also enable updates as .diff files as well so that one wouldn't have to download big files all over again each time there are changes.
Photo of Alex Bigelow

Alex Bigelow

  • 2 Posts
  • 0 Reply Likes
Yeah, JSON is trickier to treat as a stream than CSV... but generally we really only care about that if stuff is "really big"... and IMDB isn't "really big." Put another way, if IMDB doesn't have difficulty dumping it as JSON, we're not going to have much trouble parsing it as JSON.

I'm not saying IDs are a bad idea, of course - it probably would help a lot of things. Including URLs as IDs in the data dump, for example, could really help when it comes to linking external sources (I know at least MovieLens already uses IMDB URLs).

All I meant to say was that IDs aren't important if they're going to take time and effort - if they're easy, then, yes, of course, include them! But if they have to go through and figure out things like "Duchovny, David" == "David Duchovny" in order to give us good IDs, I'd rather they just gave us the messy data (sans IDs) in nicer formats (CSV, JSON, whatever) sooner rather than later.
Photo of OldskoolOrion

OldskoolOrion

  • 2 Posts
  • 0 Reply Likes
JSON combined with a NoSQL makes it far easier.
You both should try it.. instead of parsing, "just" loading and it's ready to analyze and use - this unstructured, complex looking recordsets are hellish in something really strict as an SQL server. Every change in record-layout, you will have a hard and time-consuming time.

I'm not saying a RDBMS is worse or better than any NoSQL DB, nor the other way around. I use both alot - whatever makes my life easier : that's what automating crap is for right ? :)
Photo of DavidAH_Ca

DavidAH_Ca, Champion

  • 3259 Posts
  • 2880 Reply Likes
3. IDs (and or primary-keys) are a must to link objects of different types to each other. However I assume IMDB left these out from the free data on purpose not to make to data "too useful". Having the correct IDs would allow you to link your custom app to IMDB-website for example.
Just an historic note.

At the time the lists were created, IMDb did not actually have ID constants; for many years the Primary Name and Primary Title were used as the identifiers for People and projects respectively. This is where the use of Roman numerals began, as it is essential that each primary key be unique.

The addition of the various key constants is a relatively recent addition, and IMDb has not modified the lists to include them (or any other additions).
Photo of AbhiA

AbhiA

  • 1 Post
  • 0 Reply Likes
1. There is no coherent link across the files. For example each entity has an identifier like movies have ttXXXX. But if I look at countries.list, the movies there have no id. So one has to do a name search which is inefficient.

2. Not yet, but might need to unless IMDB can provide easier api and syndicated content

3. I don't think a single data set is the answer though maintaining ids across data sets would be useful. It would be good to have a more API like model.

4. A more queryable api that provides access to data would definitely be more useful.

5. yes

6. json is sufficient.

7. no.
(Edited)
Photo of Mansour Behabadi

Mansour Behabadi

  • 2 Posts
  • 3 Reply Likes
OK. I finally decided to roll the sleeves up and do it myself. This script takes some/all of the .list.gz files and converts them to JSON:

https://github.com/oxplot/imdb2json
Photo of MattDMo

MattDMo

  • 1 Post
  • 0 Reply Likes
1. What works/doesn’t work for you with the current model?
Getting started from scratch - downloading all the necessary files (which can be quite large), processing them, then importing them into a DB or large data structure like a dataframe before you can even start querying.

2. Do you access entertainment data from other sources in addition to IMDb?
Yes, I've also used OMDb.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
It could be useful in that the processing and importation steps I mentioned above would be minimized/removed altogether, but I can imagine the downloads would be huge, and depending on how it's structured you may be getting tables that you don't want/need.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
Absolutely. Depending on the project I'm working on, I may only need a rather limited set of data per query, and a well-structured and flexible API would be fantastic. The cap rate would need to be reasonable (perhaps a free tier at X requests/sec or whatever, then paid tiers above that), and one would need unlimited access to the data (i.e., if a query returns 5000 results, have the ability to acquire all 5000, perhaps 100 or so at a time).

Personally, I'm much more comfortable working with JSON (preferred) or XML than I am with SQL, so an API would be greatly beneficial to me. On the downside, it would make things like natural language analysis more difficult if the current model were to be discontinued, so I think there's definitely a market for both means of access.

5. Does how you plan on using the data impact how you want to have it delivered?
Yes, see above. I can imagine use cases where an API would be much more effective, and others where having the full dataset immediately accessible would be beneficial.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
Ideally, the requester should be able to specify which format they'd like the data to be delivered in. It should definitely be RESTful so developers don't need to completely rewrite existing code designed for other services.

7. Are our T&Cs easy for you to understand and follow?
As far as I've experienced, yes.
Photo of llamswerdna

llamswerdna

  • 1 Post
  • 0 Reply Likes
A straightfoward way of getting the data into SQL Server would be great. It would make it much easier to write complex queries against the database
Photo of OldskoolOrion

OldskoolOrion

  • 2 Posts
  • 0 Reply Likes
If you would use a NoSQL instead of insisting on SQL-servers, you wouldn't have such a hard time.
This data is awesome and is made for MongoDB or the likes.

API would be sweet as extra tool, but I'd be ok with direct access to your Mongod, with reading auth :) Just saying hehe :)
Photo of gar37bic

gar37bic

  • 1 Post
  • 0 Reply Likes
I'm just starting to look at the situation - we haven't used IMDB prior to this.  We're using Drupal, for which there are some IMDB related modules but I haven't tested them.  Since I don't have enough information yet to answer your questions reliably, so first I'll just describe how I would like to incorporate IMDB data into our existing movie page listings.  (see https://integratedspaceanalytics.com/cms/movies)

FIrst, we are only interested in movies related to space, space exploration, etc.  This might be fiction or documentary, etc.  We already have a substantial database of relevant titles, with pictures and summary information, along with user-provided data.  I would like to add selected data from IMDB (not sure what yet).  The IMDB section of the page would be linked directly to the IMDB page, so people can get further information if they want.

Now, to the questions:

1. What works/doesn’t work for you with the current model?
I don't have any answer for this yet.

2. Do you access entertainment data from other sources in addition to IMDb?
Our existing database has been generated internally, and from our users with some data manually collected from Wikipedia.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
I think we could work either way. Pulling the data once per day makes less load on our servers as well as yours.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
We haven't used the existing FTP data yet.

5. Does how you plan on using the data impact how you want to have it delivered?
Not at this time.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
I believe that either JSON or TTL ('turtle' RDF) would be OK.

7. Are our T&Cs easy for you to understand and follow?
I haven't read them yet! :D  My expectation would be that in addition to the visible citation to IMDB, we would certainly intend to link to the IMDB site, either to the relevant page in IMDB, or if you prefer, to the main IMDB page.  We strongly believe in accurate, reliable reference information, including the date of retrieval.  We generally also cache data we collect, to maintain referential integrity, and in case a remote service is not available.  We would assume/hope that we could continue to publish that cached data under the same constraints as the original.  We certainly appreciate the hard work that IMDB has put in to supplying the data, and want to assure that our users are aware of our sources.

If desired, if you do make an api we would also consider supporting a return data channel, such as potential corrections to your data, reviews, or votes.
Photo of Niels G. W. Serup

Niels G. W. Serup

  • 3 Posts
  • 3 Reply Likes
Hello!

Thanks for the initiative.

I can see that some of the requests have been focused on the particular format of your data.  I suppose this has its importance, but I don't care as long as it's machine-readable.  As a member of a foreign, somewhat small country (Denmark), what I do care about is the availability of data related to non-US usage.

For example, your FTP archive does contain some foreign-language AKA titles, but I could only find German and Italian ones.  Meanwhile, the standard IMDb web interface shows AKA titles for lots of countries.

In general, my request is to include more foreign data in the public archives. :)

--
Niels
Photo of ljdoncel

ljdoncel, Champion

  • 564 Posts
  • 821 Reply Likes
Hi, Niels:

Ignore the files german-aka-titles.list and italian-aka-titles.list and simply download aka-titles.list. It includes all the AKA titles (Italian and German too...).

Agradable
Photo of Niels G. W. Serup

Niels G. W. Serup

  • 3 Posts
  • 3 Reply Likes
Hi!

Yes, I noticed that list as well.  However, it does not actually contain all the AKA titles present in the IMDb web interface.  All non-English AKA titles that I managed to find in that list are AKA titles of the same language as the film, typically working titles and such.

The list does not seem to contain any translated AKA titles.  For example, the film Sister Act http://www.imdb.com/title/tt0105417/ has many AKA titles on http://www.imdb.com/title/tt0105417/releaseinfo#akas but only German and Italian ones in the aka-titles.list text file.

I'm guessing this might have something to do with not including too much unverified data in the public archives, or maybe just keeping the archives simple.  Still, it's useful data and would be nice to have included. :)
Photo of ljdoncel

ljdoncel, Champion

  • 564 Posts
  • 821 Reply Likes
it does not actually contain all the AKA titles present in the IMDb web interface
You're right! I hadn't noticed it until now. I've checked an old version of aka-titles.list (dated 20 Nov 2015) and then too it contained only those two AKAs for Sister Act, so this is an existing problem for a long time.
I'm guessing this might have something to do with not including too much unverified data in the public archives, or maybe just keeping the archives simple
Like happened once with the movie-links.list file, I hope this is just a situation IMDB isn't aware of, and that it'll be fixed as soon as they read your post from above. aka-titles.list is one of the smallest archives in the repository (current file contains 465,441 aka titles and, for comparison, actors.list includes over 13,000,000 credits) so I don't think the simplicity is an issue.

Thank you very much for reporting this problem!

Agradable
Photo of Niels G. W. Serup

Niels G. W. Serup

  • 3 Posts
  • 3 Reply Likes
Thanks for following up so quickly!
Photo of Reginald Parker

Reginald Parker

  • 1 Post
  • 0 Reply Likes
Is there an update to the missing aka-titles? So far nothing changed.
Photo of Mohamed Oun

Mohamed Oun

  • 1 Post
  • 0 Reply Likes
Hello, I have a couple of questions:
1. I'd like to ask if there's a way to get the accurate votes breakdown for each movie? for example like this one:http://www.imdb.com/title/tt0111161/ratings?ref_=tt_ov_rt
I know there's a distribution column in the ratings.list file, but it's not very accurate.
2. Is there a way to turn the files into an SQL database?
Photo of Sanjuro Ouaneup

Sanjuro Ouaneup

  • 11 Posts
  • 18 Reply Likes
I'm late, sorry:

1. What works/doesn’t work for you with the current model?

The only thing right now that is bothering me is that it seems we can't get the Imdb ID of an item (like tt0050783) from those lists. Am I missing something? That means I can't create links to Imdb simply by using the lists, I need to query the server to find the ID?

2. Do you access entertainment data from other sources in addition to IMDb?

Not really. Wikipedia and, rarely, Mubi.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?

Less useful! Way too much data to handle, most of it useless.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?

If it gave access to Imdb IDs, yes it would. Also useful for making small queries. But keep it simple. In the end, I think you should have both models.

5. Does how you plan on using the data impact how you want to have it delivered?

Naturally.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?

I don't really care about JSON; CSV files like the downloadable ratings list would be fine too.

7. Are our T&Cs easy for you to understand and follow?

I don't think I've checked them, frankly.
Photo of Jim Murphy

Jim Murphy

  • 1 Post
  • 0 Reply Likes
In the T&Cs it makes it clear the interfaces are for non-commercial use only. How about commercial use? I work for a media company that would love to use some of this data for feeding predictive models.  We regularly license data and would love to find out if IMDB licenses data for commercial purposes or we could get permission to use the interface data for commercial purposes. I'd love to talk more with someone from IMDB if you're open to that. Thanks, Jim
Photo of Bob Posert

Bob Posert

  • 1 Post
  • 0 Reply Likes
First off, here's my use case:
My wife is a huge classic movie buff. She majored in film studies, and has seen literally thousands of movies of the 30's, 40's and early 50's. One entire wall of our bedroom is covered with bookshelves about classic movies, stars, studios, directors, authors, etc. I'm interested in finding her movies to watch from various online sources: streaming, etc. Also to keep track of when they're showing on TCM  etc. So I need to build some kind of list of classic movies, keep track of those that she's seen, and see if they're available.
I've been doing software development for decades, starting with C, and now moving on to more modern languages. I'll probably build a little app or just use Excel.

My preferred solution would be to get the data into Excel, and just I could work with it there. 
In particular, this would be _ideal_: Document the query parameters of advanced search (http://www.imdb.com/search/title) and allow the results to be returned in JSON, CSV, etc. 

Here are answers to your questions:
1. What works/doesn’t work for you with the current model?
Difficult to parse. Hard to join across the different files. Not clear if it's a complete dump (appears not to be).

2. Do you access entertainment data from other sources in addition to IMDb?
Don't

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
Much more useful. I could load into some DB and then access the fields I want, filter, etc.. Basically if I chose to use the current access model, the first thing I would do is to build exactly this. 

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
More useful. I never really need the entire data set, and I'm not doing any BI across it. What I need is to be able to query and see results. An API would provide this.

5. Does how you plan on using the data impact how you want to have it delivered?
I guess so. If I was doing BI (e.g. how do ratings of movies change with respect to the actors age) then I would want the entire data set.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
JSON would be fine. XML or CSV would be better for my use case.

7. Are our T&Cs easy for you to understand and follow?
To be brutally honest, I'm just using the data for personal use, so I figure it's fine and didn't read them.
(Edited)
Photo of Dan Developer

Dan Developer

  • 1 Post
  • 0 Reply Likes

1. What works/doesn't work for you with the current model?

OMG, the comments on this page are up to 3 years old.  When was it you mentioned that you were actually going to deliver on this?

In my opinion, JSON is one of the worst formats for Data.  No one can use it except developers.  Sure, they can read it.  What fool idiot is intending to read your JSON Files.  Then each File must have it’s own parsing code because none of your files are the same.  Databases have been established since the day of computers, certainly there is a real Database Format for you abundance of data that would be far more superior.  A TAB separated document would be a good example.  Equally as simple, readable, and USABLE.  Secondly, the idea of adding all the documentation into the documents is a joke.  Data is data, the documentation should be separate and intelligently structured.  Your ratings.list actually seems like at least three different databases in one, PLUS documentation...  Come on, if you’re going to be so kind as to allow the world to use your data, at least release it in a usable format.  Not many people are going to learn how to develop, write separate code for each .list, just to appreciate your data.  Most the people on the planet would be happy if they can get their equation to work in Excel, and how do ever expect them to be able to use your data without tremendous expense.

All the guys that have commented on this structure are all experts and developers.  Combined they may have billions of hours of experience making JSON work, but most people can’t use the garbage.  No XML is not the solution, it too has issues.

IMDB has numbers on every film, why are those not used with as keys???

How hard or what disadvantage could a two field TAB DB file be (or any other normal “DataBase” file)???

IMDB_Number Release-Date Done and finished, EASY for the planet to read and USE, not just developers that have spent a long time writing code to parse the data for each .list file (dozens!!!).

Not much, the files are in horrific structure and have no continuity, similarity, or keys.

It is very difficult to read the files as anything other that text on a Macintosh.  JSON fails miserably for data delivery.

It has taken an enormous amount of time to code and set up a way to parse the data so that is in record form.

It seems impossible to ever to be able to use the plot.list and others like it, as it is not in a record form.  IE: Field one, delimiter, field two, etc...

The documents in the plot.list format vary per other similar .list files.  To be able to use the plot.list, you write a lot of code to change it into a usable source.  What a lame waste of time.

Why are the IMDB Ratings embedded half way through the data, AND WITH NO DELIMITERS?  On this file, one must parse the Rating and the Title.  How can such a poor format ever have been effective???

One contributor mentioned the simple fact that the actors.list and actresses.list are not even continuous data.  They must have a lot of coding, and reformatting into a real format just to use the data.  Why the lousy game?



2. Do you access entertainment data from other sources in addition to IMDb?

I’m starting what I thought was a simple project, and yes other sources provide far more usable API’s and Data.

I assume most people are looking for the easiest and least expensive way to utilize Data.  IMDB sure has solid data, just it’s so poorly formatted!!!

It would be easier and less expensive coding to parse the data out of your web pages rather than read it from the provided .list files!


3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?

Any data format with intelligence would be far better than what has been provided.  See above, you have millions of records.  JSON is a expensive and lousy way to deliver them.

Databases should be reliable and easy to work, yes keys would bring in a lot of accuracy, far more than using the long title names or other serialization method.

Please contact me, I’d be happy to help you set up an intelligent format.  It’s the 21st millennium, it’s not that hard to have reasonable data sets.


4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?

I don’t care if I use static files or a server API, although a Server would be far easier than re-establishing a new Database to parse from for every update.

The principal idea I hope would be to provide a viable solution for the planet, not just a sub-set of elite developers that can actually figure out how to use your data.


5. Does how you plan on using the data impact how you want to have it delivered?

No, rules are no online usages.  Running specific films would be far easier in an API than growing an entire data field of files that all need to be individually and specially parsed.

Basic files are fine, it’s the format that is horrific and takes gigabytes of space to accommodate, let alone the time to process it all.   I’m not sure why anyone would want to process many times over a request for thousands of Titles.  It seems as most usages would be one at a time.  A server request would be far superior for constant requests of individual records.


6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?

As files are in JSON, no it is the worst most horrible format ever.  Using the data is very expensive!

Any real database format would be great, I suggest TAB separated vs comma since your titles have commas in them.  In your current structure, it seems as it be far easier.  Although provide a common key.

(Let’s stop the closet dweller who set this nasty thing up from creating CSV files, using comma’s separators would turn out as poorly as the current system.  The Title have commas in them!)

Actually, I’d like to see it in a JSON Reader, I wonder how it looks.  Not a reader that I have written special code into, I’ve done that.  As a usable format, JSON is garbage.


7. Are our T&Cs easy for you to understand and follow?

No, most of what has been provided seems like an insult to humanity and intended to make us hate your data, basically it’s a huge “go get lost everyone” statement.  Maybe it’s a developer’s only cliché, but how about providing something usable?

Many things in our world are easily and intelligently documented and formatted.  Why did IMDB go so far out of the way to provide such an expensive lousy system.

The idea of a daily dump is great, but keeping current would be far easier in an API from a Server.

Daily updates could be done in kilobytes, not downloading gigabytes.  Oh and then you start back at scratch, and have to re-set up all the files again...  What an inefficient complete waste of time!!!

It takes a considerable amount of time to cleanup horrific data each time, then converted it, then adjust it, etc... every time an update would be desired.  A nightmare, if I tried that daily.  And what an expense!!!

How about documentation and Data sets.  The idea of adding all the notes in the leading is cute, but it is just pollution as far as data goes. (although the first line timestamp could stay for accuracy)


NOTE:  I'm sorry if this seems unappreciative, it's not really.  I have thought a little more, and about my loud considerations.  But IMDB has opted to share with the world a great Database, but they have limited it's value to the smallest group available.

To access the Data, you must build Array's etc or convert it to "Usable Form".  Database data should not be so hard to use.  It's a data set.

Most the planet appreciates your gift, but has no way to use it without spending a large some of cash to code something.  My perception is that if you're going to share it, then do so.  Please don't lock it up in some baffoon format that only skilled trained developers can use.

The world would love to use your data set, but are likely insulted that it is provided in such a hostile form.  Sure right, JSON is too cool but only used by a very very small small group of humans.

If you're going to gift the planet, how about doing so in something the whole Planet can use???  Please think about it.  Oh sure, JSON is the best of the best for a significantly small group of people.  I have read much of the replies.  I'm glad I'm not the only one who has noticed it's ridiculous and mostly a complete waste of time.

I thought about that 'plot.list' file again, since it must be extracted by array or a billion hours making it usable data.  How obliviously nasty is it.  The text is even cut down into multiple single lines instead of a flowing text field.

PLEASE PLEASE PROVIDE USABLE DATA.  WE LOVE YOU (well at least your data) it would be really beneficial if people could use it.

They say there are few Women in tech, so basically your data can't be used by half of the planet, and not much of the other half.  Why was it made so difficult to use?

If you're going to do something, wouldn't you feel it much more rewarding to do it right and kill this small elitist format and provide something real for the real world?

We would really really appreciate something usable.

The only people that might not like you, is anyone that had to go through the extensive effort to parse your current data set.  But the next time, maybe it will take them minutes instead of many expensive hours.  I mean really, TAB was too difficult.  The whole planet can use it, it's the same text, just with intelligence.

I hope IMDB hears me and the other not so thrilled replies.  Thank you

(One side note; Sure IMDB has done a great job and has a web great site.  Thank you very much for all the effort and service.  Asking for something reasonably usable for the masses, seems a just cause to save the planet. Please consider it SOON)

(Edited)
Photo of Marcel Korpel

Marcel Korpel

  • 5 Posts
  • 5 Reply Likes
First, thanks for your data dump!

As far as I am concerned, I support most of what is already said above. What I am mostly missing in the dumped data are connections (references).

Can you say anything about the changes we can expect and when we can expect them? Thanks again.
Photo of A Scott

A Scott

  • 1 Post
  • 0 Reply Likes
It has been a very long time since this question was originally asked. Is there any progress to releasing the data in a more usable and sensible format? I fear there is none.

If anyone on this forum is interested I have been developing an application that parses the ratings.list and genres.list into a local XML data file. The files are automatically downloaded and parsed. It allows the user to perform advanced searches of the movie data.

https://github.com/adscott1982/IMDbQuery
Photo of Doug Park

Doug Park

  • 1 Post
  • 0 Reply Likes
Hi. i use the imdb data daily for 3-5 years now. i switched to omdbapi.com to query video information. The key was to search by year and title for movies or index such as "tt0795421". I would get everything back about that movie in a quasi JSON / text-value pair such as a email header and could easily parse it with perl. I downloaded the complete ftp and haven't found the information in any useful form yet that I can query information for each movie. I'm particularly interested in the movie posters which I use to automatically download. This has become more difficult with the recent changes.

They provided a simple api interface so I could query by index or by title, year, etc.
Here is a sample for a given index of what I got back from omdb and found this to be very useful format

{"Title":"Citizen Kane","Year":"1941","Rated":"APPROVED","Released":"5 Sep 1941","Runtime":"119 min","Genre":"Drama, Mystery","Director":"Orson Welles","Writer":"Herman J. Mankiewicz (original screen play), Orson Welles (original screen play)","Actors":"Joseph Cotten, Dorothy Comingore, Agnes Moorehead, Ruth Warrick","Plot":"Following the death of a publishing tycoon, news reporters scramble to discover the meaning of his final utterance.","Language":"English","Country":"USA","Awards":"Won 1 Oscar. Another 8 wins & 10 nominations.","Poster":"http://ia.media-imdb.com/images/M/MV5BMTQ2Mjc1MDQwMl5BMl5BanBnXkFtZTcwNzUyOTUyMg@@._V1_SX300.jpg&quo...}
Photo of sv

sv, Official Rep

  • 31 Posts
  • 18 Reply Likes
Thanks to the feedback and suggestions in this thread, we now have an improved and more robust interface for accessing IMDb data. 
  • Datasets are available in Amazon S3 and are refreshed daily
  • IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
  • The files are in tab separated values (TSV) format with column headers
For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.

 If you have any further questions, please see - https://getsatisfaction.com/imdb/topics/imdb-data-now-available-in-amazon-s3

 Thank you for your continued support.