How can I get the IMDB data dump with unique IDs, or in another format like XML?

  • 1
  • Question
  • Updated 7 years ago
  • Answered
Archived and Closed

This conversation is no longer open for comments or replies and is no longer visible to community members. The community moderator provided the following reason for archiving: Old thread

Hi folks, I just noticed the IMDB data dump from this page:

http://www.imdb.com/interfaces

Very cool, but when I downloaded it the plain-text format was neat but has no unique identifiers for movies so how can I link movie X to the actors in it?

I was curious if IMDB would be willing to produce these dumps as XML rather than an old-school plain-text format, and if they would include some form of unique identifier for each movie/tv/person so I can link them together for my own analysis. Right now the data is nice but it's incomplete.
Photo of MS

MS

  • 3 Posts
  • 0 Reply Likes
  • frustrated

Posted 7 years ago

  • 1
Photo of Dan Dassow

Dan Dassow, Champion

  • 16294 Posts
  • 18160 Reply Likes
The Oracle of Bacon [http://oracleofbacon.org/help.php] uses the text files mentioned in http://www.imdb.com/interfaces . The Oracle uses the data from the Internet Movie Database and can give you the shortest path from every actor and actress that can be connected to Kevin Bacon.

You may wish to contact Patrick Reynolds, who rebuilt the Oracle in 1999 and has maintained it since.
http://oracleofbacon.org/contact.php

The author of the Java Movie Database (JMDB), recently posted regarding this data. He may also be able to help you.
https://getsatisfaction.com/imdb/topi...
https://getsatisfaction.com/people/ju...
Photo of MS

MS

  • 3 Posts
  • 0 Reply Likes
How do they get around the lack of primary keys or any form of unique identifiers? It seems it would be rather easy to dump this data as a simple XML output, rather than providing almost all of the data but not the essential metadata to link the raw data.
Photo of DavidAH_Ca

DavidAH_Ca, Champion

  • 3263 Posts
  • 2925 Reply Likes
For years, the Name (for persons) and the Title (for Films) were the unique keys. I believe that much of the system still uses these.

That is why every primary Name and every primary Title must be unique, which is the reason for the Roman numerals to distinguish people with the same Name (or Titles with the same name and year).

It was only a few years ago that the n-consts and t-consts were introduced, and I believe that IMDb has not updated the plain-text file formats since well before that time. And since IMDb currently has a large number of outstanding issues, I rather doubt that they will be willing to divert manpower to changing these any time soon.

(This are my personal comments, and not IMDb policy.)
Photo of MS

MS

  • 3 Posts
  • 0 Reply Likes
If they used (name,title) tuples that seems like a fair (but antiquated) form of unique identification. I understand they are busy, but seriously this is the 21st century and they are putting out plain-text dumps of a beautifully well-structured database? I can't complain too much because it's very nice of them to offer the data set, but as someone who deals with data all of the time, it's our duty as computer scientists to offer up data in clean, developer-friendly formats. XML is super easy to build and is the de facto standard of data transmission on the internet (along with JSON I guess).
Photo of Dan Dassow

Dan Dassow, Champion

  • 16294 Posts
  • 18160 Reply Likes
From what I recall, the plain text files correspond to Col Needham's (IMDb founder and CEO) files that he created for the USENET group rec.arts.films. Col also wrote scripts to parse these plain text files before IMDb web site was established. As DavidAH_Ca points out, the name and title fields are unique keys for their respective data tables. The Oracle of Bacon is built to a large degree with the assumption that these two fields are unique.

Col is welcome to correct my recollection if I am in error.

This conversation is no longer open for comments or replies.