API/Bulk Data Access

  • 19
  • Idea
  • Updated 2 months ago
  • Implemented
Hi!

We’re in the process of reviewing how we make our data available to the outside world with the goal of making it easier for anyone to innovate and answer interesting questions with the data. If you use our current ftp solution to get data [http://www.imdb.com/interfaces] or are thinking about it, we’d love to get your feedback on the current process for accessing data and what we could do to make it easier for you to use in the future. We have some specific questions below, but would be just as happy hearing about how you access and use IMDb data to make a better overall experience.

1. What works/doesn’t work for you with the current model?
2. Do you access entertainment data from other sources in addition to IMDb?
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
5. Does how you plan on using the data impact how you want to have it delivered?
6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
7. Are our T&Cs easy for you to understand and follow?


Thanks for your time and feedback!

Regards,

Aaron
IMDb.com
Photo of Aaron

Aaron, Employee

  • 3 Posts
  • 1 Reply Like

Posted 3 years ago

  • 19
Photo of Alan Hall

Alan Hall

  • 1 Post
  • 6 Reply Likes
I think it's great you have a flat file listing of all the data, however I want to read it with a Perl program, and parsing through your format is not a simple task.  It would be much simpler if the fields were delimited by a '|', or something similar, and a code indicating movie, tv show, etc. I am looking at the actor/actress files specifically, haven't looked at the others as yet.  I don't supposed you already have Perl scripts available that would read the file and insert the data into MySQL?

Photo of Simo Tuokko

Simo Tuokko

  • 4 Posts
  • 1 Reply Like
I just wrote a simple Java program that does just that, it doesn't put them to DB but outputs '|'-separated list that can then be Mapreduce-processed.

I could clean it up a bit and put it up to Github etc if someone else is interested?
Photo of Robert Ștefan Stănescu

Robert Ștefan Stănescu

  • 1 Post
  • 0 Reply Likes
Now this is something I would really like to see. If you would be so kind to share it with us... much appreciated.
Photo of Laurent Lepinay

Laurent Lepinay

  • 2 Posts
  • 0 Reply Likes
I've looked at all the dump files, but I can't find how you figure out if a Title is a film or a Tv Show. The way I do it right now is by using the running-times.list file. If the running time is over 60 minutes I say it's a film otherwise it's a Tv Show. Is there a better way ?
Photo of DavidAH_Ca

DavidAH_Ca, Champion

  • 3249 Posts
  • 2849 Reply Likes
The Type is defined by the formatting of the title.

A TV Series is wrapped in quotes : e.g. "Fresh Point" (1997)  . 
A TV Episode has a complex title  : e.g. "Fresh Point" (1997) {Are You Ready for Sex? (#1.79)}

Note that this means that all TV Series and Episodes sort to the top of the list

See Submission Guide: Title Formats for the way to recognize the other types.

Photo of Laurent Lepinay

Laurent Lepinay

  • 2 Posts
  • 0 Reply Likes
Wow thank you, I was not aware of this guideline, really helpfull.
Photo of Steve Marshall

Steve Marshall

  • 1 Post
  • 0 Reply Likes
As a member of /dev/fort, it's really useful for us that you publish full dumps of all the data: we regularly go to locations without an internet connection to build new things, and we wouldn't be able to use IMDB data without access to a full, local mirror of the it.
Photo of DavidAH_Ca

DavidAH_Ca, Champion

  • 3249 Posts
  • 2852 Reply Likes
1) I usually use a database (MS-Access) to manipulate data.

a) For this to work well I need to have all the data on each record. In some of the lists the primary key is missing from most of the records : e.g. in the Actress list is missing her name in all but the first record for each person. This would require my writing a routine to step through the created table, inserting the name in each blank field.

b) The supplied records need to have defined fields, possibly as a .csv or .tsv file.  The use of a unique character such as the pipe (|) would be acceptable, although one of the standard formats would be easier to use.

c) The inclusion of the unique keys (t-const, n-const, etc) would be a tremendous help. While the full name can be used as a key, the consts will be easier to use and faster to process.

3) A single full database would be good to have, but it would also mean downloading a large data set even if I was only interested in one aspect.

4) A good API might be even better than the ftps, provided it is reasonably easy to use and the returned dataset meets the above criteria (all data on each record, unique keys, defined fields).

5) As noted above, I want to load this into a database; therefore a flat file with defined fields is the best option for me.

6) I have not worked with JSON and currently have no software to process such a dataset; as noted above a .tsv file would be the most useful to me.

2) I occasionally get data from other sources (such as tv.com) but IMDb is my primary source.
Photo of RyanG

RyanG

  • 8 Posts
  • 5 Reply Likes
1. Once a parser is built for the data files, parsing them basically just works, with the one major exception that there are no primary keys, so there's a lot of babysitting the data over time.
2. Sure. FreeDB, Wikipedia, Rotten Tomatoes, Amazon, among other sources.
3. Primary keys, did you say?
4. I like being able to pull all the data down at once and make minimal calls to the server, but an API in addition would not be unwelcome.
5. Sure.
6. JSON would be fine.
7. T&Cs could use some updating/examples/clarification in this mashup/integration/Web2 era.

Photo of Tanner Netterville

Tanner Netterville

  • 1 Post
  • 0 Reply Likes
I agree
Photo of Laurie Crist

Laurie Crist

  • 1 Post
  • 0 Reply Likes
still agreed with this comment
Photo of Dávid Nemeskey

Dávid Nemeskey

  • 2 Posts
  • 3 Reply Likes
I think the current format leaves a lot to be desired.

1. What works/doesn’t work for you with the current model?
Mostly nothing is working:
  • The format is undocumented, difficult to parse and differs for each file
  • I checked e.g. actors, but there appears to be no way to get the IMDB URL associated with an actor. Same with movies, etc.
  • If I want to get some information from the DB, finding the right file for it is more like a guesswork. It should be documented which file contains what.
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
Way more useful:
  • Having primary keys is always better. If the data has that, it does not even have to be in a single file.
  • Having everything in-place makes parsing easier.
  • Please use the IMDB URL (e.g. http://www.imdb.com/name/nm0000158/ for Tom Hanks) as the primary key.
  • I am not sure about the importance of a single large file; I think at least people (actors, directors, etc.) and works (movies, series, etc.) should be in separate files

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
They are complementary. APIs are good for occasionally issuing a few queries; bulk data models are good for working with large amounts of data. Having both is the best.

5. Does how you plan on using the data impact how you want to have it delivered?
See above.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
Please. JSON. Everybody can read JSON, it is the obvious choice. The current format mess is not really helpful.
Photo of Mansour Behabadi

Mansour Behabadi

  • 2 Posts
  • 3 Reply Likes
I agree. JSON is a must. If carefully done, a unix friendly flat file is also great (ie one record per line, strictly delimited).
Still, JSON takes priority, because it has structure and it can be extended without breaking existing parsers. Using JSON also eliminates encoding issues (strings must use UTF8).
Photo of Benjamin D Wakefield

Benjamin D Wakefield

  • 2 Posts
  • 0 Reply Likes
Do you really want a long string as a primary key? That seems like not a good idea. Especially if you want to join between two datasets. Joining on large strings is slow(er) than it needs to be.
Photo of Dustin Rodriguez

Dustin Rodriguez

  • 2 Posts
  • 0 Reply Likes
I forgot about the weird encoding the data files use.  UTF-8 would definitely be a very good upgrade.  In my applications one of the first thing I do is convert the strings to UTF-8 encoding.
Photo of nietaki

nietaki

  • 2 Posts
  • 0 Reply Likes
I just run into the issue. Anyone care to share what encoding they're using?

edit: ISO-8859-1 (or 2), the chardet tool in linux is brilliant :)
(Edited)
Photo of Marcel Korpel

Marcel Korpel

  • 2 Posts
  • 1 Reply Like
Not URLs but IMDb ids would be nice as primary keys, e.g. "nm0000158". And please describe its structure (characters, numbers, max length etc.).
Photo of Shahab Yunus

Shahab Yunus

  • 2 Posts
  • 0 Reply Likes
I will followup with the answers to the questionnaire soon. Meanwhile, I want to know the answer to a very basic question. The introduction of the Alternative Interfaces mentions that a 'subset' of data is provided. What here is exactly not being provided?

We are not getting all the individual entities OR we are getting all the IMDb entities but the data against some or all is not complete or both? Thanks.
Photo of Dirkmb

Dirkmb

  • 1 Post
  • 0 Reply Likes
I'm using your data for a local movie database. For the movie thumbnails I'm using a different server.
Your data is very hard to parse, but I think there was documentation somewhere on the FTP.
First of all I'm missing some unique ID. I would be glad if you choose to use your ID (the one which is shown in the url).
Also, it would be much easier to parse and use your data if you're using a standardized format (JSON would be OK or an SQL dump would be fine, too).
I have a very fast internet connection, so I personally do not care if you provide an very big file with all the data, or several smaller files. But for most of the users I think multiple files with an different content type would be better.
An API would not improve the current situation. Sometimes I'm looking for a special movie and perform some complex search querys, cause I only remember some little details about one movie. I think that would not be possible with an API and a local search would be faster.
Thanks for providing the data. I'm looking forward to your changes.
Photo of symmetric

symmetric

  • 1 Post
  • 0 Reply Likes
1. The format of the current model is a bit tedious to work with:

i) The instructions and data are mixed, which means getting to the computer-readible part of a list requires manual  (and brittle) code to skip the header. The archive should include a README file with instructions, and then a machine-readible file with just the raw data.
ii) Parsing the data is a pain; TSV would be a much preferable format.

2. No
3. Probably more useful, assuming it was easy to manipulate e.g. in SQLite. Currently, a lot of effort is spent on parsing the dataset.
4. Potentially more useful, as long as it was possible to get the full dataset in API 
5. -
6. Yes
7. Yes
Photo of Andrew Jackman

Andrew Jackman

  • 1 Post
  • 0 Reply Likes
1. What works/doesn’t work for you with the current model?
Using a common delimiter makes so much more sense for your flat-file dumps. Having a variable number of parameters with a variety of delimiters makes my head hurt and my code slow.
2. Do you access entertainment data from other sources in addition to IMDb?
No.
3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
I personally find using third-party (not internal) primary keys to be brittle.
4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
Right now I'm using the IMDB for instruction and performance testing. If IMDB were to provide a JSON+REST API, it would allow me to teach more data access techniques with the same dataset. Right now I simply don't have an application for an IMDB API.
5. Does how you plan on using the data impact how you want to have it delivered?
For real-time applications, having access to a live IMDB API would be awesome. Caching could be done with an internal policy (possibly influenced by IMDB) instead of simply waiting for the FTP dump to be updated. A live API has the potential to use less bandwidth of myself or IMDB depending on the load and the usage compared to the FTP dump.
6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
I very much prefer JSON as most languages that I use have (near) standard libraries for parsing JSON. It's less verbose (read: smaller) and easier to read than XML. As a maintainer of such a large, plain text data set, I'd still consider keeping a delimited (CSV or equivalent) version if the size difference was noticeable.
7. Are our T&Cs easy for you to understand and follow?
I've never had a problem with your T&Cs.
Photo of William

William

  • 2 Posts
  • 1 Reply Like
This is sweet music to my ears. At least half a dozen times I have sat down determined to write a modern parser for the IMDb text data in .NET and ultimately given up because (a) the data is infuriatingly hard to work with and (b) it doesn't provide any way of interacting with personalised data such as my watchlist and ratings. As you must be aware there are several unofficial APIs to your data, all of which are either poor quality, unreliable or incomplete. Competitors like the Open Movie Database and Rotten Tomatoes APIs simply can't compete in terms of data quality or completeness, so a modern IMDb API would be a dream come true and I would start using it from launch day.

1. What works/doesn’t work for you with the current model?

Quite frankly the current model is horrendous. The raw data is presented in a format so proprietary and unstructured it's almost as if it was done deliberately to discourage uptake. I appreciate that probably isn't the case and it's just very old. I got relatively far with a parser once but found I was hitting error after error whilst hunting for specific text patterns rather than parsing a known format.

More importantly though, I have no need for or interest in locally storing the entire database. If you were to start presenting the data in say, XML or JSON as flat files I would most definitely consider using it, but I would MUCH prefer an API that allowed me to retrieve individual records or sets of records rather than parsing an enormous volume of data from which I might only ever use a tiny fraction.

2. Do you access entertainment data from other sources in addition to IMDb?

I've investigated using the Rotten Tomatoes API, the Open Movie Database and various unofficial IMDb APIs but none of them really do what I want.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?

Yes, because it would be easier to construct a usable database from. Any modern application consuming this data is going to want to construct a relational or NoSQL database from the data, for which primary keys are essential. However, a single large dataset would be far less useful to me than a random access API.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?

MUCH more useful. Having a complete copy of the database locally may have it's advantages for some applications but it would be so large that it wouldn't make much sense on a mobile app for example, and would hence necessitate the construction of an intermediate database and web service API constructed from the dumped data files, which would be less reliable and less current than an API provided directly by IMDb. Most apps (certainly the kind I have in mind) would want to perform searches and maybe cache a small number of records locally for performance, but have no need for fully offline access.

5. Does how you plan on using the data impact how you want to have it delivered?

Yes. The use case I have in mind right now means I would want to be able to access a user's watchlist and ratings (with their authorisation of course using OAuth or similar) and access key data about the movies that user has expressly registered their interest in through the watchlist and ratings features. That use case would be ruled out by a static data dump.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?

JSON would be my first choice of data format, although I'd argue that a choice of JSON or XML is best practice for a modern web API design. Ideally the API would adhere to REST best practice and employ content negotiation for format selection.

7. Are our T&Cs easy for you to understand and follow?

Yes, no issues in that regard.
Photo of Peter Smith

Peter Smith

  • 1 Post
  • 0 Reply Likes
I didn't know that the data was even available -- my own use for it would be to hook up to my off-line dictionary; the potential feature is: "type in a word, and see it in the context of movie tag lines and plots".

1. What works/doesn’t work for you with the current model?
Answer: it's a slightly wierd-looking format, but nothing that looks too hard to parse.

2. Do you access entertainment data from other sources in addition to IMDb?
Answer: don't know; I have only a potential feature

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
Answer: no

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
Answer: absolutely not.  My dictionary program is strictly off-line for speed.  I would only be interested in bulk data

5. Does how you plan on using the data impact how you want to have it delivered?
Answer: no really; I would probably simply bundle up the appropriate text files into my app

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
Answer: JSON would be superior but not essential.

7. Are our T&Cs easy for you to understand and follow?
Answer; no, they are not.  There's a set on the T&C page that don't really match the conditions in the files themselves.  Also as an FYI: the "email us for T&C" technically only applies if I want express written consent.  If I want to use implied consent, then I don't have to email you for alternative terms.  But the terms for implied consent are pretty vague.
Photo of pvginkel

pvginkel

  • 1 Post
  • 0 Reply Likes
I would really love an SQLite database file. For simple use cases, this would be enough already. For more advanced use cases, creating a script that moves/syncs the data to a "real" database becomes a lot easier because the full relational schema is already there.
Photo of Benjamin D Wakefield

Benjamin D Wakefield

  • 2 Posts
  • 0 Reply Likes
There is some good feedback already and while I have interest in the IMDB data I am not currently using it and don't have any specifics in mind.

The best thing you can do is provide a documented REST API with a STRONGLY documented data structure. Be clear with this part and everything else kind of falls into place.

If you want to provide data via FTP, why not have an option for full/incremental database backups? It doesn't have to be a dump of your full DB just the fields you want to provide. Then anyone who wants to use it can either load and translate as needed or just restore the backup and use the DB as is. Maybe consider a data warehouse type export? This would make sites that want to use your data be able to have fast read/report style access (since I would assume modifying it is not super useful).

Definitely include primary keys! If you have lots of surrogate keys (good) make sure there is some kind of unique identifier for each record.

If I was to build an application making use of the data I would prefer a readonly model that I could refresh from your source as needed. If I wanted to store additional information with the IMDB data I would have a metadata database/table structure that could be referenced via the IMDB PK fields. This way I can refresh that data at will and not lose any of my additional data.
Photo of paolosg

paolosg

  • 1 Post
  • 0 Reply Likes
Dear Aaron and IMDb staff,

I am a researcher at University of Verona, Italy.
With two collegues from CEA - Saclay in Paris we are doing a scientific research based on IMDb data, with the aim of identifying the determinants of movie success and their evolution in time.

To this purpose, we need to revert the database to some previous state in time, but unfortunately we noticed that some of the diff files are missing or empty (list below).
Is there any way to recover/obtain them?

I really appreciate your help on this matter, that is so important to carry out successfully our research on the movie industry.
Moreover this could be an occasion to fix the consistency of a database that, apart from the due improvement discussed in this page, is actually great to have publicly available.

I look forward to hearing from you.
Best regards,

Paolo Sgrignoli

---
Missing diff files:
diffs-140207.tar.gz
diffs-140131.tar.gz
diffs-130621.tar.gz
diffs-090313.tar.gz
diffs-060707.tar.gz
diffs-060630.tar.gz
diffs-060623.tar.gz
diffs-060616.tar.gz
diffs-050422.tar.gz
diffs-041022.tar.gz
diffs-040625.tar.gz
diffs-030516.tar.gz
diffs-030103.tar.gz
diffs-021213.tar.gz
diffs-000609.tar.gz
diffs-000602.tar.gz
diffs-981113.tar.gz
diffs-981106.tar.gz

Empty diff files:
diffs-140613.tar.gz
diffs-140214.tar.gz
diffs-100514.tar.gz
diffs-100507.tar.gz
diffs-050819.tar.gz
diffs-050812.tar.gz
diffs-050506.tar.gz
Photo of Aaron

Aaron, Employee

  • 3 Posts
  • 1 Reply Like
Paolo,

Thanks for the feedback. We no longer keep historical diffs, so would be unable to provide.

Regards,

Aaron
Photo of Keshav Prasad Meda

Keshav Prasad Meda

  • 1 Post
  • 0 Reply Likes
It would be best if the database is designed in a way it can be easily -
1 Maintained
2 Modified 
3 Distributed
4 Accessed

The first three things are out of my expertise to answer, however for having to Access all the data in the best way, you can have

Relational Databases with primary key and foreign keys
SAS or Access databsets would be good
All of the data should be accessible by using joins
For eg;

Movies table will say have

Movie | Year | Length | Rating | No of Ratings | Cost | Earnings | No of weeks in theaters | Awards |etc

And actors table can have

Actor | Movie/ (TV Series) | Character name | rating | Awards

Actor 2 table can have

Actor | Age | Sex | No of movies | No of TV shows | No of Awards

Now all of the data can be accessed by joins. This structure avoids redundancy, so reduces space consumption.
Photo of Dustin Rodriguez

Dustin Rodriguez

  • 2 Posts
  • 0 Reply Likes
I am very thankful that IMDB makes their data available in bulk format.  I'm a big fan of film and have had fun playing around with the IMDB data.

That being said, the format is really terrible to work with.  When I first started using it I figured that it must actually be formatted that way to discourage people from using it.  Creating a regular expression to parse even just the title is quite tricky with so many optional fields and different delimiters which may or may not be present.

Some here have recommended delimited files, but I think that would be a mistake.  While delimited files are perfect for tabular data, many of the files do not contain tabular data but instead blocks of data.  (Such as a set of AKA titles for each movie title)  A much superior solution, which would be usable by everyone very easily, would be to use JSON.  A JSON file can easily include additional fields for some records and fewer for others.  There are also robust, fast parsers for pretty much every language and platform out there.  A big benefit to IMDB would be that changing the format would be very easy.  If you wanted to include a new bit of information for some records, the data could be added without disrupting any existing parsers.  It would just be a new field that they aren't looking for anyway.

As some have mentioned, having a sort of "primary key" to cross-relate the files would be extremely useful.  IMDB already has such a key, of course, in the ID used in the sites URLs.  I imagine that bit of information might not be provided in the files in order to discourage screen scraping, but if so that's not very effective.  It just makes people looking to scrape info from your site hit your search engine before grabbing the movies page. 

I find your Terms of Service very reasonable.  I've never wanted to do anything that conflicts with them for my own playing around. 

Whatever changes are made, I sincerely hope that you do not do away with the ability to get the data in bulk.  Many APIs are designed such that they just service specific requests, but for the things I do at least (such as analyzing the differences between the ratings assigned to movies in different countries) such would be useless.  My worst-case-scenario would be IMDB developing an API, doing away with the bulk data download, and limiting the number of requests per day.  For applications that just let people see information about movies on a personal wishlist or watching for new productions from a favorite director or the like that would be sufficient, but for any statistical analysis it'd be useless.
Photo of Henk Jansen

Henk Jansen

  • 1 Post
  • 0 Reply Likes
1. What works/doesn’t work for you with the current model?
Parsing the 50 files is cumbersome and error-prone due to different formats and errors in the data. It's also a pain to recombine the information into the separate entities persons and movies.

Having the data is great though.

2. Do you access entertainment data from other sources in addition to IMDb?
No.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
See my answer to 1. Merging the information from separate files is annoying because there are about half a dozen different file formats. Having just one big data set would make all of these problems go away.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
No, I prefer having the data in one download. APIs bring usage restrictions and force me to use it in a very specific and as such limited way and I want to be free in how I query the data.

5. Does how you plan on using the data impact how you want to have it delivered?
Kind of. Like I said in answer 4, having a data set I can download is preferable to an API. I want to be able to run queries and combine the IMDb data with other data sets.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
JSON would be completely sufficient.

7. Are our T&Cs easy for you to understand and follow?
Yes.
Photo of Andreas Born

Andreas Born

  • 1 Post
  • 0 Reply Likes
Hello IMDb Team,

at first, I'd like to thank you very much for making tha data available at all! I really apprecciate it, because i know how much work it is to create and maintain such a project.

But as you already know, the current data format is really hard to handle for most uses. I am IT developer and entrepreneuer, creating and running databases with hundrets of millions of entries. From my point of view, your kind of data is best suited for use within SQL databases, containing primary keys and foreign key relations. That way, a SQL dump would be awesome for anyone who uses SQL for querying the data.

But let me first answer your questions:

1. What works/doesn’t work for you with the current model?
A: The current model is not uniform, not scalable, it's hard to build an index on top and is therefore killing performance. Complex searches accross different fields are not possible, simply because there are no (delimited) fields. To use this data in a convenient manner, I have to convert it to a standard database format that allows to join datasets using primary and foreign key relations and to perform logarithmic scaled searches on B-tree indexes for best performance. But even a simple convertion to SQL is not made easy. Even more problematic: Information is not only stored within in the data itself, but also within the data structure, which is, at least, an absolute no-go when dealing with data structures. NEVER store information within your structure!

2. Do you access entertainment data from other sources in addition to IMDb?
A: I consider to do so, yes. But this requires a clear type-to-subdata-attribution. That means that there should be a strong relationship model between a data entry and it's object entity. So, a name should be a film name OR a series' episode name, but not both. However, I know that changing a data model is not that easy as to just define a different data format. I guess we have to live with that.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
A: I'm not really sure if I understand this question. Primary keys are most relevant, yes, but "a single large dataset"? A single SQL dump would be okay, becaus it contains several tables.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
A: Less useful, because an api offers only a limited capability of querying the data. And it decreases performance on large-scale-queries. And it's OS and language (code) dependent.

5. Does how you plan on using the data impact how you want to have it delivered?
A: It depends. ;-) I do always transform all source data in an overall standard structure like CSV, SQL, RDF OR XML. This standard structure can then be optimized for the way it will be accessed. Therefore, it would -of course- be nice to have this data already stored in such a standard format. But I don't really bother which standard format you choose. I cannot think of any situation, in that anybody profits from the data stored in a nonstandard format like it currently is. However, some projects may require nonstandard formats, but the probability it matches a nonstandard format like yours, is rather low.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
A: It would be still much better than the current format, because it can be easily loaded and converted to SQL and so on. But: JSON also lacks containing an explicit format declaration. SQL, XML and RDF would. And JSON tends to reach a format complexity in its depth that cannot be easily guessed. Positively, there are a lot of JSON-Parsers that make it easy to load the data, charset is UTF-8 over all, so it's best suited to "gather" all relevant information. But not to "find" data using complex searches. JSON is optimal for websites using Javascript/ECMA and Ajax to asychronously query data from the Webserver, it would be okay to serve data that is small enough to fit into memory and this a thousand times at once. JSON is quite useful for data exchange, when the data format is well-known, but not for database dumps. In any case, users need a way to query data without loading large datafiles into their memory. For JSON, you'd need to load all the data or you transfer it to another format. But: when you need a conversion anyway, why not offering the data already in the target format?

7. Are our T&Cs easy for you to understand and follow?
A: Yes, so long. But I might contact you next year to clearify some special issues. You probably should think about updating your T&Cs for web2.0 usecases. We are currently working (internal) on web3.0 features. One major problem is that a lot of old T&Cs are rarely web2.0 compliant, so not web3.0 compliant at all.

----

I've got some questions, too. Are the flat files really the ones you're manually working with, e.g. adding and modifying data? Or do you already use a database for it? If yes, which kind? At least your web platform will make use of a database for performance reasons. So why don't you offer the data in just the format that you are using yourself?

Nowadays data exchange formats shall be well-defined and standard-compliant. That ensures that all data can be imported into totally different DBMS without interfering the data model itself or even data integrity. While XML is only suitable for small amounts of data, SQL-dumps target only SQL-type databases, of course. The question whether SQL is the best format may be answered by the question if there is anyone who doesn't use SQL. Lets assume the latter, and go on. RDF is a subset ob XML but more complex than simple XML: it introduces the possibility to store and set up a data graph. A data graph is -simply spoken- a set containig two types of elements: nodes and edges. Each node is an object, and each edge a relation between two objects. The difference to standard XML and even SQL is, that it allows to store objects and relations that may not be predefined, a graph can record data without modifying the data model. It would be great for IMDb-like data, but I guess that it's too complicated for simple usuage. So there remain CSV and JSON. To be not language-dependent at all, I would prefer CSV over JSON. Charset UTF-8, predefined field names, field delimiters and field enclosures with c-style escapes. Field names should be unique across different files, each file should contain one dataset per line, and each dataset should contain a numeric primary key. If this primary keys are uniqe across all files, users could still set up a graph on top of the data. Using JSON instead of CSV would be okay, if it still represents one dataset per line. But using one object for all data in JSON has no advantage over standard-csv in my opinion. I don't prefer having all data in one file, except for a SQL dump, which could come in 3 or 4 different files (films, series, actors/actresses, other data). For CSV and JSON it might be better to offer one file for each table.

If i can be of any help, please contact me using my email address imdb -at- abotech.de.

If I fully understand your data model and format, I might be able to provide serverspace and different formats for download and/or access as contribution to your project. Of course, regarding your T&Cs, this would require your explicit pemission first.


Best regards,
Andy

PS: sorry for my bad English :)
Photo of Pierre Kan

Pierre Kan

  • 1 Post
  • 0 Reply Likes
Hi Andy,

Do you have the available IMDb as RDF files? I'm working on a project and this format would save me a lot of time!

Many thanks for your help in advance,
Fred
Photo of William

William

  • 2 Posts
  • 1 Reply Like
Any chance of an update on this from IMDb staff? The question was posted months ago now, and it would be nice to know what if anything you are working on in this area.
Photo of nietaki

nietaki

  • 2 Posts
  • 0 Reply Likes
1. What works/doesn’t work for you with the current model?

The data is somehow structured, but the format isn't documented and is very difficult to parse. Having any structured format would be helpful, especially correctly delimited csv, since the files might be a little too big for more sophisticated formats.

2. Do you access entertainment data from other sources in addition to IMDb?

I'm using filmweb.pl data and looking into rotten tomatoes.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?

While I'm not sure if a single data file would be more helpful, having primary keys for each of the categories (movies, directors, actors, genres) and having them used in all the files would be very useful. Performing joins based on titles that are strings tens of characters long is inefficient and problematic. IMDB is already using integer-based id's in its urls (say tt0330243) so there shouldn't be any problem with exposing the ids.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?

It would compliment the current bulk data approach. An API would be good for lookup and smaller operations, but would be taxing to use for statistical/bulk purposes. You cannot replace the bulk access with an API, but on the other hand it's easy to create an API when you have access to flat files, so I wouldn't axe them.

5. Does how you plan on using the data impact how you want to have it delivered?

Yes, but different use-cases would call for very similar formats.

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?

JSON and flat CSV files would be sufficient.

7. Are our T&Cs easy for you to understand and follow?

They aren't too easy to find.
Photo of Robbie

Robbie

  • 3 Posts
  • 1 Reply Like
Hi, I have been using the flat files off and on for a long time and here are my comments:

1. What works/doesn’t work for you with the current model?
I think what you have works for the most part, I use perl and any textual data is pretty easy to read. My number 1 big, big, big, complaint is, there is no mapping of title and names to the unique IMDb id number, so there is no (supported) way to create a link back to the IMDb page. If I am creating an application, I would generally use the downloaded data to show some subset of information, but I still want to create a link back to the IMDb page. I don't want to recreate the entire IMDb page, plus there is some information, like the users own personal data (title rating, lists the title is on, etc.) that just isn't available unless you go back to the IMDb site.
Another small nit is the files are not very well documented.

2. Do you access entertainment data from other sources in addition to IMDb?
Currently using TCM. Was using the Netflix API until they shut that down.

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
Either one. Large data set is appropriate for showing lots of data, smaller data sets is appropriate for showing small subsets of data. It depends on the application.

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?
I think an API would be more efficient. I think it is more efficient to just go query for what you want than to download blocks of data all the time when you may not use a lot of the data.

5. Does how you plan on using the data impact how you want to have it delivered?
No

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?
JSON would be ok.

7. Are our T&Cs easy for you to understand and follow?
Yes
Photo of mx dog

mx dog

  • 0 Posts
  • 0 Reply Likes
1. What works/doesn't work for you with the current model?
  
Updating with dif files or complete reloading is a pain to say the least 
imdb title number should be the number 1 index across all tables to at least pretend the database is relational  
what does work is the availability while in my perfect world I would just be able to load a title set and query from your servers  I do understand what a load that would put on you so a local copy of data works if it is a bit cumbersome to maintain

2. Do you access entertainment data from other sources in addition to IMDb?
 
Not usually you do have the best database but tv shows could use some work and segregation of foreign films could use some work (US perspective)

3. Would a single large data set with primary keys be more or less useful to you than the current access model? Why?
YES . Ease of maintenance and simplicity again that word relational!

4. Would an API that provides access to IMDb data be more or less useful to you than the current access model? Why?

YES . Direct access to your databases would allow me to pick and choose the titles,fields I want instead of wading through it all. For instance my personal collection has about 2600 movie and tv episodes  and 20400 albums and songs in many different formats I am after a monolithic database to store the info on those (yes I know are a ton of them out there) and populate them with the data I choose so querying for just those titles,information programmatically would make life much easier also it would probably free up a ton of  load on your end by not having scrapers doing all the dirty work that most people and aps use now just think of the 10's of  millions of searches and page loads that would go away with a simple api and I would think your actual views would be about the same like when I look up a movie I'm watching that traffic would remain constant

5. Does how you plan on using the data impact how you want to have it delivered?
YES I am a small user so occasional online access would be perfect getting relatively small data sets with an occasional larger set as in a initial population

6. Is JSON format sufficient for your use cases (current or future) or would additional format options be useful? Why?

JSON is fine but XML and text delimited would be useful too just because of the common tools I have available and for ease of hand editing if needed bearing in mind im using this data for a personal database and not to populate web pages

7. Are our T&Cs easy for you to understand and follow?

hehe it must not be I'm not sure what T&Cs are

thanks for keeping this mostly open you are my first choice when it comes to movie and tv information and have done a great job through the years
Photo of Dries

Dries

  • 1 Post
  • 0 Reply Likes
I work for QlikView and I often build my own models at home with real life data that's not boring, like bank or mine production data.  I've been thinking about building a QlikView app for some time now, and I can't explain how delighted I was to find out that the data is actually availble outside of the web site or mobile app. I'm going to start now, and will be updating you on the progress, and how I find your current data structure.

It's awesome that you are willing to share everyone's hard work and passion.

Watch this space...
Photo of Hank Storeo

Hank Storeo

  • 1 Post
  • 0 Reply Likes
Hi,

I work with a few BI tools and its fantastic this data is available.. That being said at least as a start it would be extremely helpful to delimit the data.. Its technically possibly do a lot of parsing on this structure but its not really sustainable and a simple delimitation would remove a lot of the complexity.. At least as a first iteration it would start to move this project forward! :) 
(Edited)