What file encoding is used for the plain text files?

  • 1
  • Question
  • Updated 7 years ago
  • Answered
Archived and Closed

This conversation is no longer open for comments or replies and is no longer visible to community members. The community moderator provided the following reason for archiving: No longer relevant

I'm attempting to read the mpaa ratings reasons text file (mpaa-ratings-reasons.list), using Python 3.3.  Python 3.x wants to know what encoding the file is in, so it can convert to its internal representation.  I tried UTF-8 and windows-1255 encodings. but both give tracebacks when reading some lines (they error out on different lines, but they both error out).

What encoding were the files created with?

Thanks!

Photo of Dan Stromberg

Dan Stromberg

  • 2 Posts
  • 0 Reply Likes

Posted 7 years ago

  • 1
Photo of DavidAH_Ca

DavidAH_Ca, Champion

  • 3263 Posts
  • 2925 Reply Likes
I can't say for sure, but when IMDb started it only had support for 7-bit ASCII but later upgraded to Latin-1.

As they stated in the 1996 Help Add Full manual :
Character Sets
--------------
At the present time, the IMDb only supports
7-bit ASCII owing to limitations in the mail
system and other complications. Please do not
submit extended 8 bit characters as they will
unfortunately be silently discarded by the
mail-server. Instead, please use whatever the
local convention is for mapping extended
characters back to 7-bit ASCII (eg: u umlaut
-> ue in German). We will be rolling out full
support for the ISO-Latin-1 character set
later in the year.
I have never heard of any change from that, so I expect that that is still the code page for most of the database. (There are exceptions for a couple of the AKA titles - e.g. Russian titles.)



(Edited)
Photo of Dan Stromberg

Dan Stromberg

  • 2 Posts
  • 0 Reply Likes
The ratings reasons file has characters that are not 7 bit ASCII...  I can read the file using an ISO-8859-1 encoding though.
Photo of DavidAH_Ca

DavidAH_Ca, Champion

  • 3263 Posts
  • 2925 Reply Likes
I phrased the final paragraph badly. By "any change from that" I meant "any change from the ISO-Latin-1 character set" that they were about to update to.

They did change to Latin-1 but have not, to the best of my knowledge,  moved on from there.

I understand that part of the problem is that the system used some characters (like the pipe - "|" ) as controls, and there are worries that some of the UTF-8 characters might cause serious problems as the data goes through the various sub-systems. IMDb is in the midst of re-writing the various sub-systems; I expect (and hope) that they are making sure that the new programs will support Unicode (or UTF-8), but I don't think we will see any significant change until they have all sub-systems switched over.

(I rather hope I am shown to be wrong on that final statement.)


This conversation is no longer open for comments or replies.