Can WebSphinx process http://en.wikipedia.org/wiki/Stonehenge?
This is related to the shortest path problem. Is anybody else experiencing problems processing http://en.wikipedia.org/wiki/Stonehenge? If I set that page as the only seed then the crawler stops immediately. Also if I try to get to that page "indirectly" e.g. via http://en.wikipedia.org/wiki/Stonehenge_(disambiguation) then WebSphinx does not process it (it passes shouldVisit but never enters doVisit).
Sure, I can find the shortest path by "manual" BFS but that can hardly be the point of the assignment...
Can anyone help me?
Sure, I can find the shortest path by "manual" BFS but that can hardly be the point of the assignment...
Can anyone help me?
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?Web sphynx _does_ work on wikipedia. However some pages are over web sphynx's default max size of 100 kb. In the web sphynx UI click the "advanced" button in the top right, go to the "limits" tab, and increase the limit. In your code use the setDownloadParameters method to increase the page size.
public class WikipediaCrawler extends Crawler {
public WikipediaCrawler(){
this.setDownloadParameters(DownloadParameters.DEFAULT.changeMaxPageSize(1000));
}
}
Loading Profile...



EMPLOYEE