Get your own customer support community
 

Crawl time

If we assume we don't want to send more than 1 request to wikipedia per second, crawling 3M pages requires more than 800 hours. There is a dump of the wikipedia available at http://static.wikipedia.org/ that we can use instead. Instead of getting the page from the web we can read it from the file system. All of the next steps are the same: we can parse the page and extract the links and follow them.

What do you think?
Inappropriate?
1 person likes this idea

User_default_medium