Typos in actor-list data

  • 1
  • Question
  • Updated 4 years ago
  • Answered
Hello, I am practicing BigData-programming and used your Plain Text Data files as training data.

I found following typos in actors.list data (same typos are visible on imdb.com as well):

---

# Extra "]" inside character name
Ablow, Keith|"The O'Reilly Factor" (1996) {(2008-06-06)} [Himself]] <6>

# Loose "25", it's maybe billing position missing <>
Brown, Tyler|Michael whole: The Wedding Ringer (2015) 25 [Batting Cage Kid]

# Loose "director of photography" missing eihter () or []
Case, Robert|Drug Wars (2014) director of photography

# Extra "]" inside character name
Gahan, Oscar|The Mysterious Pilot (1937) (uncredited) [Lumberjack Gorman]/Musician]

# Extra "]" inside character name
Lagner, Pavel|Knoflíkári (1997) [Zákazník/úchyl na ulici c 1]] <12>

# Loose "Thomas Dupont" missing eihter () or []
Wickman, Billy|Seventh Son (2014/I) Thomas Dupont [Young Guard] <47>


---

Thank you.

PS. Reason why I had to parse the data in the first place was because the actors.list format is very cumbersome and not suitable for Mapreduce processing for example. I would suggest CSV format like:

ActorName | Title | Year/distinguishNo | Type[M,V,E,TV,VG] | EpisodeName | Season.EpisodeNo | CharacterName(s) | Comment | BillingPosition

Duplication actor name on each role would make the file a bit larger, but improvement on processing performance (and developer comfort) would be ten-fold!
Photo of Simo Tuokko

Simo Tuokko

  • 4 Posts
  • 1 Reply Like

Posted 4 years ago

  • 1
Photo of Dan Dassow

Dan Dassow, Champion

  • 12437 Posts
  • 12383 Reply Likes
Hi Simo Tuokko,

Thank you for reporting the problem.

The data set format dates back to the early 1990s when file size was a significant concern.

CSV format would be nice and as you point out duplicating the actor name would make the files much easier to parse.

It is really up to IMDb to determine whether there is sufficient use of these data sets to justify the cost of reformatting the data as you suggest. If IMDb does reformat the data they may wish to retain the original format for legacy users of the data.

Note: I'm not an employee staff member.
Photo of Simo Tuokko

Simo Tuokko

  • 4 Posts
  • 1 Reply Like
Now I processed the actresses.list as well and it had following typos:

# Extra "]" inside character name
Ambrosavage, Mary whole: "The Ethical Slut" (2013) {Full Exposure (#2.14)} [Principal Keller]]

# Loose "1985-1992"
Hall, London whole: "One Life to Live" (1968) 1985-1992 [waitress/model/hotel patron]

# Loose "Guest"
Jordan, Bianca whole: "Joonas Hytönen show" (1999) {(2000-04-26)} Guest [Herself]

# Extra ")"
Sigmund, Monika whole: Asphalt (1951) ) [Helli] <28>
Photo of DavidAH_Ca

DavidAH_Ca, Champion

  • 3261 Posts
  • 2915 Reply Likes
The lists are duplicates of the data in the main data base, so any error on the main pages will be reflected in the lists. The data should be corrected via the main Update system (starting with the Edit page button on the Name or Title page). Once the corrections are accepted, they should be reflected in the lists; however, I am not sure how often the lists are updated, so there might be a delay before this happens.

As Dan Dassow noted, this is a very old layout, created for conditions that no longer obtain, so IMDb is looking at changing the format of these lists to make them easier to use. You might want to check out API/Bulk Data Access and leave a reply.