Uncomplete datasets?

  • 2
  • Question
  • Updated 7 months ago
  • Answered
Hi,

I uploaded IMDb datasets as of 30th March 2019, timestamped around noon GMT.
I understand title.basics to be the backbone of the sets, listing all titles.
But I notice in this version that the last one is tt9916896, when I am positive that IMDb has crossed the 10,000,000 threshold in February, or maybe March.

Is IMDb aware of this?
Photo of Vincent Fournols

Vincent Fournols

  • 2901 Posts
  • 4891 Reply Likes

Posted 7 months ago

  • 2
Photo of ljdoncel

ljdoncel, Champion

  • 844 Posts
  • 1791 Reply Likes
Hey, Vincent:

Note that, despite the ttconsts contain a numerical part, they are sorted alphabetically...




Photo of MAthePA

MAthePA

  • 2063 Posts
  • 3494 Reply Likes
grazie milleĀ 
Photo of Vincent Fournols

Vincent Fournols

  • 2901 Posts
  • 4891 Reply Likes
Silly of me, and now I think you have already pointed that out!
Nevertheless, it seems there is a big gap between 9,916,896 and 10,000,000
Anyway, muchas gracias ;)
Photo of ljdoncel

ljdoncel, Champion

  • 844 Posts
  • 1791 Reply Likes
...there is a big gap between 9,916,896 and 10,000,000
You're right. Currently that's the largest gap (84,103 blank pages) between two tconsts on .

As of today's afternoon the highest tconst was tt10136648 whereas there were 5,769,583 unique titles, so around 43% of the tconsts are unoccupied. Well, that's only partially true because some of them can actually redirect to a higher tconst after a merging, so are not really "empty". However, since all titles added since a date some years ago have even tt numbers (from tt2404814 onwards) most of them are in fact blank title pages. Moreover, the most frequent distance between two tconsts is 2 (gap=1; 56.7%) while only 2,212,421 tconsts are consecutive to the previous one (gap=0; 38.3%).

That also explains that larger-than-1 gaps between tconsts (as a result of delete/merge processes) are mostly odd numbers (even[hi]-even[lo]-1=odd), as can be seen in the following graph (note the log-10 scale for the Y-axis):


(Edited)
Photo of Owen Rees

Owen Rees

  • 223 Posts
  • 334 Reply Likes
The gap leading up to the switch to eight digit numbers may be because the sequence that allocates numbers was updated manually at a time when the relevant staff would be around to make sure there were no issues when the tconsts became longer. I suspect that the software team and the data editors had a carefully planned event where the sequence was updated and then the flow of the incoming titles was watched very carefully to ensure that everything was working correctly.