How do I make my crawl faster?
I can only crawl at a max of 10 pages per second. My average is about 7 or 8 (in the beginning, too!)
Currently our structure is to have each crawler maintain its own counts until I ask for them to save and aggregate. Our parsing is done in each individual crawler thread and the 10 largest lipogram/palindrome/rhopalics are kept in each crawler until I ask for them.
This aggregation is performed every 2 minutes. Our parsing is done inside the Crawler class. Will it be faster to put it in another thread somehow? What other tips do you have for making it faster (besides checking over the algorithms we use to parse)
EDIT:
Oh god, it's gotten much worse after just 30 minutes of crawl. Now the pages fetched per second are just returning binary... 0's and 1's...
Currently our structure is to have each crawler maintain its own counts until I ask for them to save and aggregate. Our parsing is done in each individual crawler thread and the 10 largest lipogram/palindrome/rhopalics are kept in each crawler until I ask for them.
This aggregation is performed every 2 minutes. Our parsing is done inside the Crawler class. Will it be faster to put it in another thread somehow? What other tips do you have for making it faster (besides checking over the algorithms we use to parse)
EDIT:
Oh god, it's gotten much worse after just 30 minutes of crawl. Now the pages fetched per second are just returning binary... 0's and 1's...
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?- Each crawler is a separate thread, therefore you don't need another thread for parsing.
- How do you aggregate data during crawl? Crawler4j doesn't allow communication between crawlers during crawl.
- What's the 2 minutes period? If you dump data each 2 minutes, it is too small. -
Inappropriate?Before you updated crawler4j to 1.04, we built our own mechanism for aggregating data. Here's how:
-In the constructor of each "Crawler" class, each crawler "registers" itself in a master list of crawlers. There is a separate thread that runs that iterates through each crawler every 2 minutes collecting new data.
Should we go back and download 1.04 and use those features instead?
And yeah, we are working on finding out what is wrong with our palindromes. -
Inappropriate?This is not a good practice. Why do you need aggregation every two minutes? Let the crawlers work separately to be as fast as possible. You can aggregate your data at the end of the crawl and if a crash happens you still have the partial dumps.
Loading Profile...



EMPLOYEE