IMDb flat database files could be made more database-friendly

  • 1
  • Idea
  • Updated 1 year ago
  • Implemented
Hi, I have just started to use (for personal use only) the IMDb database files that can be downloaded via ftp. I really appreciate this service, because it makes it possible for users to create a copy of the IMDb data files on their own PC at home. This makes it possible to search the database with custom key words, which helps to make better selections. I am now in the process of creating a relational DBMS for IMDb in MS Access.

@Giancarlo Cairella:
I have one complaint, however, of a technical kind. I find some of the files really hard to implement in a database, such as the actors and the actresses flat files. To include some data in these files - such as line numbers and actor/actress ID's on every line - would be very helpful in setting these files up for inclusion in a relational DBMS. It is possible to insert line numbers in front of each line with a vbs script operating on the file, but inserting actor/actress ID's is a lot more challenging. On close inspection, there seems to be some ambiguity in these files as to the establishing of the identity of the actors/actresses. Even though great care seems to have been taken to ensure that each actor/actress name is unique in the database and belongs to one and the same person, some identical names still come across as to belong to different people in several cases. And - the other way around - in some cases the same person seems to carry slightly different names. I also noticed that there are blank lines (or vertical spaces) in the files. In the Actors and Actresses files (for instance) I take these blank lines to be separators between two different actors/actresses, but I have my doubts if the separator is always correctly placed. For in several instances I noticed that such a separator was indeed incorrectly placed. It would be a time-consuming job (and even not always possible) to correct all of these mistakes. There are many instances of such ambiguities, which makes it very hard to create fully normalized tables.

Also, these particular two files seem to be made up with a reading purpose in mind rather than to be used in a database system. Needless to say that documents like these aren't very practical for reading purposes, as they contain millions of lines.

So in the end, I'm very happy to be allowed to use these data files for my personal use, but if the files would have been given a more database-friendly format, that would have made me even happier.
Photo of Ron Springwater

Ron Springwater

  • 1 Post
  • 0 Reply Likes

Posted 6 years ago

  • 1
Photo of sv

sv, Official Rep

  • 31 Posts
  • 18 Reply Likes
The new datasets that are available in S3 include the IMDb identifiers for titles and persons. You can now uniquely identify the entities and construct links back to IMDb. For more details, seeĀ http://www.imdb.com/interfaces