Getting web sphynx to work on wikipedia
Web sphynx _does_ work on wikipedia. However some pages are over web sphynx's default max size of 100 kb. In fact, both Irvine California and Bubonic Plague are. In the web sphynx UI click the "advanced" button in the top right, go to the "limits" tab, and increase the limit. In your code use the setDownloadParameters method to increase the page size.
public class WikipediaCrawler extends Crawler {
public WikipediaCrawler(){
this.setDownloadParameters(DownloadParameters.DEFAULT.changeMaxPageSize(1000));
}
}
public class WikipediaCrawler extends Crawler {
public WikipediaCrawler(){
this.setDownloadParameters(DownloadParameters.DEFAULT.changeMaxPageSize(1000));
}
}
1
person likes this idea
I like this idea!
Tell me when this idea gets some attention.
The more people who like this idea, the more it gets noticed.
The more people who like this idea, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?Thanks Chris. There is also a setting to turn off robots.txt complaince in the same place. I recommend that you guys find a way to throttle your requests also so that you don't get banned.
Anyone have ideas on that.
Also the crawler workbench (the GUI thing) is fine for exploring WebSphinx, but assignment 3 should be written as your own program
Loading Profile...



EMPLOYEE