Crawl time
If we assume we don't want to send more than 1 request to wikipedia per second, crawling 3M pages requires more than 800 hours. There is a dump of the wikipedia available at http://static.wikipedia.org/ that we can use instead. Instead of getting the page from the web we can read it from the file system. All of the next steps are the same: we can parse the page and extract the links and follow them.
What do you think?
What do you think?
1
person likes this idea
I like this idea!
Tell me when this idea gets some attention.
The more people who like this idea, the more it gets noticed.
The more people who like this idea, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?I'm aware of the static crawl that wikipedia makes available, but this is an exercise in actually "doing a crawl". At the end of the day, finding the palindromes and stuff makes it a little more fun, but this isn't really about that, it's about dealing with the details of actually trying to crawl the web.
So the short answer, is no, please crawl the live version of wikipedia.
Loading Profile...


EMPLOYEE
EMPLOYEE