IMDb Data – Now available in Amazon S3

  • 1
  • Announcement
  • Updated 3 hours ago
  • (Edited)
This is an announcement for customers of the IMDb bulk data available via FTP.

We are pleased to announce, starting today IMDb datasets are now available in Amazon S3. Using the new interface, customers can bulk-access IMDb title and name data. This more robust and reliable solution will replace the IMDb FTP sites, which will be retired on November 7, 2017 September 10, 2017.


For details on the S3 solution, file format and access guidelines, see www.imdb.com/interfaces.

In our continued effort to best serve our Contributors, we are streamlining the datasets and making them available in a more useful and structured format in S3. Notably:

  • Data refresh frequency is now daily (previously weekly). 
  • IMDb title and name identifiers are included in all the files for ease of matching and linking back to IMDb.
  • The files are in tab separated values (TSV) format.
  • The sets of data we provide are updated to only include the essential ones that help with matching and linking to an IMDb title or name.
  • To access the IMDb S3 buckets, users will need an Amazon AWS account.
As part of housekeeping the FTP site, we are deleting files we no longer update and are changing the directory structure. This will take effect next week. The list data files will be available at two new locations (see below) until November 7, 2017. Please note that the directory structure is subject to change without notice. We strongly encourage FTP site users to switch to the S3 solution at the earliest to ensure their applications continue to work without interruption.

 If you are not an IMDb Contributor and wish to obtain IMDb content for commercial use, we offer a content license.  The license grants you access to our content via an XML web service, plus the right to use the content in your product or service.  If that interests you, please email licensing@imdb.com.

 If you have any questions or concerns, please share your feedback in this thread.

 Thank you for your continued support.
Photo of sv

sv, Official Rep

  • 25 Posts
  • 11 Reply Likes

Posted 3 months ago

  • 1
Photo of dgranger

dgranger

  • 863 Posts
  • 197 Reply Likes
In English, what does that mean? Does that mean that the IMDB is going to totally pay per use? Does that mean that the regular free site is shutting down?
(Edited)
Photo of sv

sv, Official Rep

  • 25 Posts
  • 11 Reply Likes
No, this announcement is for users of the IMDb ftp data only.  No other IMDb services are affected. 

The FTP sites will be retired on Sept 10, 2017. The S3 datasets that replace the FTP site is configured to be requester-pays, the requester accessing data from this bucket is responsible for the data transfer and request costs. For details on the charges, please refer to https://aws.amazon.com/s3/pricing/.

AWS S3 bucket will be the only location to bulk access IMDb datasets for non-commercial use after Sept 10th.
(Edited)
Photo of dgranger

dgranger

  • 863 Posts
  • 197 Reply Likes
So this will be a pay per use site. Not a good idea. 
(Edited)
Photo of sv

sv, Official Rep

  • 25 Posts
  • 11 Reply Likes
No, IMDb remains a free site as always.

Just to clarify, the requester-pays configuration is only for bulk accessing IMDb datasets from S3 and not for the www.imdb.com website.
(Edited)
Photo of matiss.zz

matiss.zz

  • 2 Posts
  • 0 Reply Likes
Any idea when alternative titles will be available as a dataset on S3, like the aka-titles.list.gz on the old ftp servers? I don't see it on S3 and the current aka-titles.list file doesn't appear to be complete if I compare the data in that file with data on the actual IMDb site.
Photo of ddb

ddb

  • 2 Posts
  • 2 Reply Likes
I don't see any file that shows either all cast for a title or all titles a cast member is in. I need full xref like the old actors/actresses.list files.
Photo of sv

sv, Official Rep

  • 24 Posts
  • 9 Reply Likes
The datasets in S3 are focused on the contributor usecases for matching names and titles and linking back to IMDb. The S3 tables title.principals and title.crew provide the principal cast and crew information respectively for all the titles. The name.basics table has the known-for titles for all the names along with their top-3 professions.

To help us better understand your usecase, please share details on how you use/plan to use cast & crew data.
Photo of ddb

ddb

  • 2 Posts
  • 2 Reply Likes
My usecase is for tracking my personal collection (currently about 35,000 entries) and wish lists.  Since these files have been available for personal use for years, I normally download the actors/actresses.list once a year. Each one of these I then process into 4 files (movie title / TV episode / cast with all their movie titles [S3 not available] / cast with all their TV episodes [S3 not available]).  I track my collection on the first 2 lists, keeping track of what type of media it is on, the alternative title if mine is different than imdb's, etc.  Then I programmatically update this info on the much larger second 2 lists.  If at some point I decide to add actress Jane Doe to my wish list, anything she was in that I already have is already marked for me.   About once a year, I find that I need to start adding my new titles manually so it is then time to download again.   It takes a couple weeks after each download to verify changes, etc., to get good updated lists (you'd probably be surprised how many times you change release year on some titles, etc.).  This has been very convenient for my purposes.

 The S3 files appear to contain all titles and all cast but no way to get more than a subset of the xref between them (i.e., no way to list all the cast for a title, or all the titles a cast member is in).  At this point it looks like I may as well take my latest file set, strip it down to only those entries I actually own, and copy/paste updates in as my collection grows.  Like some of the other posts, I just thought going to the cloud would mean more data available, not less.  Not really interested in learning enough java or something to just get  a few downloads once in a while either.  Thanks for all the help you've been to me through the years.
Photo of matiss.zz

matiss.zz

  • 2 Posts
  • 0 Reply Likes
This reply was created from a merged topic originally titled Getting an up-to-date alternate title dataset, like the old aka-titles.list.gz.

I'm trying to get the full and up-to-date dataset for alternative movie titles. I first tried the archive from the ftp servers - aka-titles.list.gz - but some titles appear to not have all the information.

Example:
http://www.imdb.com/title/tt0090557/releaseinfo?ref_=tt_dt_dt#akas

That has 17 alt. titles, while aka-titles.list contains only 3.

I figured the new S3 datasets would be better, but those don't even have alt. titles available, just the most basic movie information.

Any idea how this could be resolved?
Photo of sv

sv, Official Rep

  • 25 Posts
  • 11 Reply Likes
The alternative versions of titles in S3 are limited to the primary title and original title in title.basics table. 

To help us better understand your usecase, please share details on how you use/plan to use the alternate titles data. 
Photo of clcdpc

clcdpc

  • 2 Posts
  • 3 Reply Likes
Will there be any way of accessing the certificates.list file? It doesn't appear to be part of the current S3 bucket.
Photo of sv

sv, Official Rep

  • 24 Posts
  • 9 Reply Likes
Certificates data is not part of the S3 dataset. Could you please share details on how you use this data?
Photo of clcdpc

clcdpc

  • 2 Posts
  • 3 Reply Likes
We're using that information in a tool that helps public libraries better classify their material when they make it available for the public to use. This information helps streamline our workflows and also allows the parents we serve to make more informed decisions.
Photo of JW

JW

  • 1 Post
  • 3 Reply Likes
You must be joking. Some sort of convoluted user-registration and user-pays is somehow going to be better than ftp? Seems very stupid and totally unnecessary if you ask me. And AWS would have to be the most convoluted and non-user-friendly system anyone has ever invented.
(Edited)
Photo of Marcel Korpel

Marcel Korpel

  • 5 Posts
  • 5 Reply Likes
Thinking about this for a day, now.

So, in short, it seems that there is less data (that is easier to parse, I assume) and you actually have to pay and jump through hoops setting up (an) account(s)* to an AWS and S3 to be able to access the bulk data using an API I have to learn using. If I am correct, you have to pay for the bandwidth, so are there even diff files provided to lessen that burden, like on the FTP sites?

All in all, I am sad to say this doesn't sound as an improvement to me.

* Using a credit card, which is not that common in several countries in the world (in the Netherlands, for instance, debit cards are far more common)
Photo of sv

sv, Official Rep

  • 24 Posts
  • 9 Reply Likes
Accessing the S3 buckets will require an AWS account and an AWS SDK or the Command Line Interface. We have provided a couple of sample codes for accessing the S3 bucket using the AWS Java SDK - http://www.imdb.com/interfaces.

If a simpler GUI-based access is required, there are some third-party S3 clients that provide that as well. 

With the requester-pays S3 bucket, users pay for the data transfer and network cost. That said, the files hosted on the S3 bucket are gzipped and the largest dataset as of today is 151MB. 

As part of the AWS Free Usage Tier, currently, new AWS customers receive 5 GB of Amazon S3 storage in the Standard Storage class, 20,000 Get Requests, 2,000 Put Requests, and 15 GB of data transfer out each month for one year. So if your request pattern falls within those limits, you are unlikely to incur any data transfer or request costs for the first year. After that according to Amazon S3 pricing chart the data transfer (to Internet) of the first 1 GB / month is Free and $0.01 per 10,000 requests (standard GET request).

Hope this helps address your concerns. 
Photo of Gardner von Holt

Gardner von Holt

  • 9 Posts
  • 14 Reply Likes
My use case is similar to some above.  I have perhaps twice to three times a year refreshed a series of about 8 files (actor, actress, movie, director, aka titles, genres, mpaa, and taglines), so that I would have compete cast, title, directory, genres for my personal movie collection.

While I accept that things change over the years, the main thing that I would desire to have restored is complete actor and actress credits for future movies.

I have never used the data for anything other than my personal movie collection, and despite being a software developer have never offered this software publicly to date.

I have invested significantly over the years in developing my internal movie database oriented software to enhance my enjoyment of movies, and have likewise for many years bought most of my movies from amazon.com, .de, or .co.uk.  If I gave in on this I would be reducing years of development of my movie tracking to wasted effort.

You can measure my length of use of the data and imdb to my userid (gvh), one I could no longer acquire today, and that I have had an account with amazon for about 20 years.

Additionally, I believe Amazon made representations when you purchased IMDB to continue to make it available to the public, and have for many many years provided this data for non-commercial purposes.  I have invested very significantly in this software and am frustrated to see access withdrawn to data that has always been there.

What would make me happy? Any of the following:

* Ability to download full data for movies you purchased from any amazon business, or view on amazon prime movies (I get data for those movies I pay for). Im totally ok with limiting my access to those movies I have a commercial relationship with amazon about.

* Some API  (rate limited would be ok) to return the full cast for an individual movie
This would allow me to add new entries for newly purchased movies, a vast majority of which I buy from amazon.  And limiting me to a handful of queries a day or week would be acceptable.  (I get 5 queries a day plus 5 queries for each movie I purchase at amazon, for example)

* Full data limited to recently released movies and dvds (so that when I buy a movie I can add new data). I rarely purchase new movies that are back catalog.  Almost everything is purchased when the dvd goes open for sale, so any clever way of limiting me to newly released movies would solve much of this need.

* Ill even pay a token amount per set of queries if you want some way to guard against being spammed with accounts.  Or you can charge my AWS account if thats possible.

S3 as a data source, and tabbed text is fine for me, as would any sort of API or method you could offer, Ill write whatever software is necessary to access the data, and as I said, this is only necessary for new titles, so I'm totally ok with rate limiting my access (given that there is some way to test and develop with sample data or something in a non-rate limited way)

Thanks, and I can be reached at the email associated with this account for further discussion
Photo of Vincent Fournols

Vincent Fournols

  • 5 Posts
  • 4 Reply Likes
I am in a very similar situation compared to the previous use cases presented above: more than 20 years contribution to the IMDb under "fourvin", use of twice yearly IMDb download data to update my personal DB through a self-developped interface in VBA (I cannot invest in new languages and technologies anymore).

But no way to restrict it only to recent releases and DVD: I watch a lot of movies from on-demand catalogues, whatever year they were created, and I wish to get the full data for the targeted films or people around the movies I have seen.

I also feel cheated to having contributed freely to the IMDb and see the access to this data (e.g. aka-titles) made unvailable.
Photo of Andrew Gallant

Andrew Gallant

  • 5 Posts
  • 2 Reply Likes
Thanks for doing this! Parsing TSV files will be much easier than parsing the old format.

I did run across one small technical issue. In `title.basics.tsv`, there are a few records that appear to be malformed. For example:

tt2222222	tvEpisode	"Hollywood Regency Meets Country Club Chic	"Hollywood Regency Meets Country Club Chic	0	2011	\N	\N	Reality-TV

In this record, the `primaryTitle` and `originalTitle` fields appear to begin with a double quote, but there is no corresponding closing quote. The actual name of the title does start with a double quote, so I think the correct format would be:

tt2222222	tvEpisode	"""Hollywood Regency Meets Country Club Chic" """Hollywood Regency Meets Country Club Chic" 0	2011	\N	\N	Reality-TV

Since the CSV/TSV format escapes quotes by doubling them.

Thanks!
Photo of sv

sv, Official Rep

  • 24 Posts
  • 9 Reply Likes
Thanks for reporting the issue. We are looking into it.
Photo of Ron

Ron

  • 57 Posts
  • 7 Reply Likes
I am not a "customer of the IMDb bulk data", but as a contributor, I do download the raw data sets on occasion in order to parse through them and submit corrections for obvious errors.

For example, mis-spellings in release date attributes.  I take it this data is disappearing forever, given the very limited imdb-datasets items shown on the interfaces page.

Perhaps it might be better to list what data is remaining available, and what data is going away?
Thanks.
Photo of Marcel Korpel

Marcel Korpel

  • 5 Posts
  • 5 Reply Likes
I agree, the old datasets contained way more data. E.g., I'd like to do statistical analyses with technical data, let's say aspect ratio used through the years; or list movies with alternative runtimes, or several uses of color (b/w and color in one movie).

There were so many other (statistical) use cases I had with the bulk data and it's sad that those are no longer possible.
Photo of Col Needham

Col Needham, Official Rep

  • 3814 Posts
  • 1324 Reply Likes
For the release date attribute (and similar cases), the attribute field browser inside the submissions interface may help. 

All the live attributes can be browsed and filtered via https://contribute.imdb.com/updates/field/release_dates/attr 
Photo of Ron

Ron

  • 57 Posts
  • 7 Reply Likes
Yes, that is where the issues are first noticed (attribute field browser).  Without the raw data being available, how does one find which film any attribute is attached to?  I looked through the Advanced Title Search, but didn't find a way to search the attributes.  I can use Google (or similar) to try and find them.  Hopefully there's a way (I haven't found yet) to continue using this data to make IMDb better.  Thanks.

Here are some examples of Los Angeles release date attributes that I couldn't fix because Google couldn't find them.

34      Los Angeles, CA
6      Los Angeles, Ca
1      Los Angeles,California

(they should all be 'Los Angeles, California')
(Edited)
Photo of GP

GP

  • 1 Post
  • 1 Reply Like
Just checked out AWS. As long as there is no "bill capping" feature implemented (which is -funny as it is- requested by a lot of people for almost a decade), there is no way to set your bill to a hard limit.
Admittedly, most cases of hacked accounts were the result of unintentionally published private keys on github. Still, for me, using such a service would give me sleepless nights, as I could never be sure not to wake up with a multi-thousand dollar debt.
This attitude is irresponsible from Amazon, and also means, that IMDB has no longer a reasonable public available data set.
Photo of Nobody

Nobody

  • 1443 Posts
  • 683 Reply Likes
... there is no "bill capping" feature ....
Interestingly, that issue is mentioned in Wikipedia's article about S3.
According to that article:

"... AWS does not provide native bandwidth limiting and as such
users have no access to automated cost control.  This can lead to
users on the 'free-tier' S3 or small hobby users amassing dramatic bills. ..."
(Edited)
Photo of sv

sv, Official Rep

  • 25 Posts
  • 11 Reply Likes
Thank you for your feedbackregarding the launch of IMDb datasets in S3.

We will be sure to take all the feedback into consideration moving forward as we consider further updates to this product offering.
Photo of Gardner von Holt

Gardner von Holt

  • 9 Posts
  • 14 Reply Likes
So in English, "we are going ahead with our plans to cut access, despite the negative feedback" and "Nothing changes in the short term, data will be cut off"

Do I have that right?
Photo of Scott

Scott

  • 1 Post
  • 3 Reply Likes
For my use cases, which is a personal TV listings application, I am missing aka-titles.list, ratings.list and language.list.

I use aka-titles.list to match TV listing data to IMDB info, because the TV listing data often uses an alternative name for movies.

For ratings.list, unless I am missing something, the top 250 information is not available in the new format, which I use to highlight those movies in the TV listings as well as to track my progress in watching those movies.

I use language.list to show the original language of movies as opposed to the language being broadcasted.
Photo of Luca Canali

Luca Canali

  • 2 Posts
  • 3 Reply Likes
For my personal TV application the aka-titles.list is really a requirement. Consider that outside USA you really need to use the translated titles, as the English and original titles are just unknown to most people.

The keyword.list data is also required, as it allows to categorize movies.

Also please continue to share the data file with FTP/HTTP. Most people won't start using and paying S3 for their personal projects just to obtain such data.

You are just forcing people to start scraping imdb.com to obtain the same information that before was easily and freely obtainable from FTP/HTTP.
Photo of Gardner von Holt

Gardner von Holt

  • 9 Posts
  • 14 Reply Likes
It is sad to see the end of what Col Needham started.

I wonder what he would think of this direction for his project, it certainly flies in the face of the entire nature of IMDB's history and values.

Those that continue to need this sort of access, might want to check out https://www.themoviedb.org
Photo of Luca Canali

Luca Canali

  • 2 Posts
  • 3 Reply Likes
Even more frustrating is that is not understandable the reason of doing this.

Really, why forcing to use S3 instead of a normal FTP/HTTP download ? And why removing so much useful info ?

Would be interesting to know if Col Needham is really aware of this...
Photo of Vincent Fournols

Vincent Fournols

  • 5 Posts
  • 4 Reply Likes
I am afraid so, as he replied to a post above...
Photo of Gardner von Holt

Gardner von Holt

  • 9 Posts
  • 14 Reply Likes
I missed that thanks, I'll note that he didn't really say anything supporting the move, mostly just how to do something.  That he didn't announce this, isn't voicing any support for this to date speaks volumes.  
After so long it seems a shame to move to another service, IMDB used to be synonymous with the best source of movie data.  Guess its now a marketing thing.  
What a waste of decades of people contributing to make it better, and now, its something to sell Amazon goods and services.
Photo of TheObviousOne

TheObviousOne

  • 1 Post
  • 7 Reply Likes
This seems to negate the bidirectional, symbiotic nature of IMDb. It seems to purposefully ignore that IMDb wouldn't be the "same" if it wouldn't be for both the contributors and for ... IMDb itself.  One without the other(s) and the IMDb data (what "IMDb" is at its core) of today would be many times of "poorer" quality. It is true that there is a fundamental asymmetry in how IMDb works, but that hasn't never been the problem, more like its strength: the contributors which contribute the most aren't certainly the ones whom benefit profit/consume the most data; contributors give time, IMDb time and all the (necessary) rest.

This "move" from "IMDb" seems unnecessary because there other models that would work in reducing operational costs and data consumption greatly without the need to subject the bona fide data consumers to this convoluted, Rube Goldberg-esque, nonsensical process: the first that comes to mind, without putting too much thought into it, and with trivial impact (for both interested parties) it would be to maintain the FTP process but, instead of being anonymous, make it to a previously assigned user:password (maybe the same as the IMBd login credentials?) and to a specific time-based quota (X files/bytes could be downloaded daily/monthly/yearly/whatever per FTP/IMDb user). That would separate the "abusers" from the "little guy" (it's seemingly ironical how the ones that would be most damaged by the "new" process would be the ones that "consume" the less and, probably, contribute the most...; this "move" equates to shooting birds with a cannon, not the right tool for the job: if the aim is to deter and fight abuse [I infer, as no "plausible" motive was forwarded, the other alternative being some misguided corporate bottom-line measure that simply didn't account for all the factors in their equation -- meaning, there's more to be lost by possibly alienating contributors than to be gained] why does it "burn" all the rest -- incidental users: the vast majority, I surmise -- at the same time? So unfair and a whole lot of "bad-Karma points"...)

Plus, this "move" from "IMDb" seems also shortsighted because it fails to understand the fundamental reason why IMDb is, well, IMDb. It wouldn't be the same quality-wise (meaning data quality) and quantity-wise without the army(s) of busy bees (AKA contributors) constantly mending/tending/pruning/trimming/grafting, 24/7, every square inch of their "digital property". Try to convert that (if you can) to the army(s) of employees that had to be payed (plus benefits, plus ...) to work less hours and with lesser effectiveness (a labor of love, isn't really labor, and it can't really be matched). That would be the scenario of having IMDb without happy/motivated contributors (which is where "moves" just like this one would lead, eventually, down the road).

And if IMDb considers that this isn't a problem, because it's "too big to fail", right now and it can do as it pleases just ... because, I'll leave you with this 'food for thought': if (hopefully not) for some unfortunate reason IMDb would cease to exist today, and all the data would be lost, a new IMDb would rise tomorrow (let's call it for argument's sake, NewIMDb), the (existing) contributors would flock over it and in a short amount of time something resembling (Old)IMDb would be there, standing proud and tall -- (Old)IMDb nothing but a fading memory... If the exact opposite were to happen (meaning, all the contributors ceased, for some unwanted, unforeseen reason, to stop supporting, visiting and contributing every minute to the collective hive that is IMDb, the same (i.e., IMDb) would slowly -- but surely -- wither and "die" (vanish).

The longevity (and quality) of IMDb is due to its openness and two-way street nature, not in-spite of it.

Final disclaimer: I love IMDb and want it to outlive the planet itself, but I couldn't just watch doing this to itself and stand idle. I also know this isn't a "conversation" (a conversation doesn't start: "we are doing this, what do you think?"; a conversation starts with: "what do you think of the possibility of changing from this to that?") but I hope smarter (smartest) heads will prevail. For IMDb's sake and success. Including in that all of IMBd's ecosystem. Thank you.



PS: this what I would also have written if, all of a sudden, Wikipedia started erecting some kind of fences/barriers to some type of its content. NewWikipedia would be fast in its place, without missing a beat...
(Edited)
Photo of Ivan

Ivan

  • 3 Posts
  • 1 Reply Like
This reply was created from a merged topic originally titled Alternative interfaces old format is not available anymore?.

Hi guys!

Sometimes I used to download from Alternative Interfaces files like actors.list, movies.list and so on. I used them mostly for some machine learning educational purposes. But a couple of weeks ago I discovered that these files are not available anymore. Only interface with connecting to AWS database is proposed. Which is paid and honestly speaking I'm not very familiar with all this aws stuff. Just plain text files were more than enough for me.

Is there any way to get these files up to date or they are deprecated and gone forever?
Photo of sv

sv, Official Rep

  • 24 Posts
  • 9 Reply Likes
The .list files continue to be available for download and use from the FTP sites. But they will be retired on September 10th. We strongly encourage customers to switch to the S3 solution for uninterrupted access.

Here is the link to the FTP sites:
ftp://ftp.funet.fi/pub/mirrors/ftp.im...
ftp://ftp.fu-berlin.de/pub/misc/movie...
Photo of sv

sv, Official Rep

  • 25 Posts
  • 11 Reply Likes
We greatly appreciate our customers sharing their usecases and concerns with the new IMDb Datasets. After careful consideration,we will be adding TitleAKAs to the datasets in S3. This will be available in the coming weeks in the S3 bucket: imdb-datasets. We will update this thread and www.imdb.com/interfaces page once we have more details.

Thank you for your continued support.
(Edited)
Photo of Ivan

Ivan

  • 3 Posts
  • 1 Reply Like
Thank you so much!
Photo of Nick

Nick

  • 3 Posts
  • 7 Reply Likes
I am horrendously disappointed by this change. I actually legitimately cried when my IMDB database updater threw a 404.
I get that you guys want money and promoting s3, but its kind of a dirty tactic to force me to pay for something that I have spent almost two years developing and creating my own little in house 'Netflix' service for my personal collection of movies and tv, around all of the data that you guys so graciously provided us with prior.
My system was already setup to prevent strain on bandwidth on your end, only once a month it would check for updates, and then re-download updates/diffs, because I use it all...
I use the titles, all the akas, keywords/taglines/writers/quotes/ratings-reasons/locations/genres/composers/actresses/actors/directors/countries/distributors are all intricately tied into my searching, running-times, ratings, release dates, and various other files are all meshed into a distrusted MySQL database allowing me near real-time data draw.
To know I am going to HAVE to re-build, re-structure, and re-engineer the entire way my system is built around you data truly means I will probably have to scrap, and abandon the entire project.
Albeit it was just for my own personal fun it saddens me to see you guys make a change that breaks two years of my fun time work.
Photo of Gardner von Holt

Gardner von Holt

  • 9 Posts
  • 14 Reply Likes
Very well said.
Given that the clients are now paying the cost of transmitting the data, It is totally unclear why imdb is so focused on withholding data.  
This feels non-technical in nature, the use of this alternate data cannot possibly be hurting the revenue model for IMDB, and with "requestor pays" there is no cost to transmit the data.
It is all very curious, and I wish imdb(amazon) would more openly explain why they are doing this.
Photo of Nick

Nick

  • 3 Posts
  • 7 Reply Likes
I agree as well. Am very curious as to why or what motivation they have to make this change. I mean its been that way for 20 years now, so why the sudden change. I get and am totally all for improving/innovation, but to entirely cut people out seems backwards.
Photo of Emmanuel

Emmanuel

  • 1 Post
  • 0 Reply Likes
Hello,

does the distributors data will still be available thanks to s3 ? 
Photo of Guillaume

Guillaume

  • 2 Posts
  • 2 Reply Likes
Hi,

May I ask that you distribute the locations.list file please? I'm using it to get localized movie information of places around the world. 

Also, an example using the AWS CLI would be much appreciated. I'm having issues accessing the new endpoint. Am I doing something wrong?

aws s3 cp s3://imdb-datasets/documents/v1/current/name.basics.tsv.gz name.basics.tsv.gz
A client error (403) occurred when calling the HeadObject operation: Forbidden
Photo of Andrew Gallant

Andrew Gallant

  • 5 Posts
  • 2 Reply Likes
The files in s3 are configured so that the requester is forced to pay, which requires sending an extra config knob to AWS to acknowledge that you're willing to pay for the request. The `aws` CLI tool doesn't completely support this knob. It's available when using `aws s3 ls` but not `aws s3 cp`. For example, this works (assuming your AWS credentials are setup properly):

aws s3 ls s3://imdb-datasets/documents/v1/current/ --request-payer requester

While `aws s3 cp` doesn't support the requester pays option, the `s3cmd` tool does. So to copy all of the current IMDb datasets, you can do this:

mkdir imdb
s3cmd sync s3://imdb-datasets/documents/v1/current ./imdb --requester-pays

And that's pretty much it!
Photo of Gardner von Holt

Gardner von Holt

  • 8 Posts
  • 13 Reply Likes
Some of the 3rd party ftp clients also do not natively support sending this flag.  I have approached the makers of the market leading Mac client "transmit" to add this feature, but for now it remains unsupported there as well.
Photo of Andrew Gallant

Andrew Gallant

  • 5 Posts
  • 2 Reply Likes
Note that this is for s3, which isn't FTP.
Photo of Gardner von Holt

Gardner von Holt

  • 8 Posts
  • 13 Reply Likes
Oops, my bad, I should have been clearer. Transmit supports s3: protocol as well as many others (google cloud, one drive, etc).  I should have been more clear in the category, this category originated as ftp clients but they all support many protocols now :)
Photo of Guillaume

Guillaume

  • 2 Posts
  • 2 Reply Likes
Thanks a ton for the help Andrew!
Photo of josh

josh

  • 1 Post
  • 4 Reply Likes
Have you considered to make available the new files also with FTP on a less frequent time base ?

For example, you can have daily updates on S3 and monthly updates on FTP.

In this way you can both promote S3, and at the same time don't disrupt personal and open source projects. S3 is just not an option for a lot of us.

Having keywords info would be also very good.

Thanks
Photo of Jeorj Euler

Jeorj Euler

  • 126 Posts
  • 16 Reply Likes
This does not improve anything for bulk data users, and thus far, it serves to discontinue the availability of certain information, as pointed out earlier by others submitting remarks to this GS topic.

Back in 2014 or so, Netflix (which uses S3) discontinued public access to its API altogether and even limited existing access to a select portion of established users. We shouldn't even be surprised by this move. Things like this seem to be a theme of the 2010s. (For that, Amazon does not deserve all of the blame.)

O, and if there is to be the remark "don't knock it until you've tried it", I'm none too pleased that I have to present credit card information or any information about my identity and "billing address" (often not separate from mailing address) at all in order to "try" it, even with the time-limited free access (as newcomers may not necessarily be ready to make the best time-limited use of all the things that Amazon Web Services offer). I'm sure countless other users feel the same way as I do. Hopefully, there will be some S3 users who will broker out the information stored in IMDb's "data sets" S3 bucket, on agreeable terms, and hopefully diff files will be made available (to spare S3 and its Internet Service Providers of their bandwidth burdens).