Work Queue and DbState
Hey everyone.
Our group has been getting some interesting errors. We've tried two crawls. One last night that hit the errors this morning, and another we started this morning and stopped this afternoon.
Crawl 1 data:
crawl-log.txt
nohup.out
Crawl 2 data:
crawl-log.txt
nohup.out
For both we used...
Crawler4j 1.0.5
openlab.ics.uci.edu
the nohup script from Yasser
with 20 crawlers.
Our group has been getting some interesting errors. We've tried two crawls. One last night that hit the errors this morning, and another we started this morning and stopped this afternoon.
Crawl 1 data:
crawl-log.txt
nohup.out
Crawl 2 data:
crawl-log.txt
nohup.out
For both we used...
Crawler4j 1.0.5
openlab.ics.uci.edu
the nohup script from Yasser
with 20 crawlers.
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?I don't know how you're dumping your data, but you're obviously doing something wrong in your dumpMyData() function. This exception in your logs show that multiple threads are modifying the same data structure (probably ArrayList) at the same time:
java.util.ConcurrentModificationException
at java.util.AbstractList$Itr.checkForComodification(AbstractList.java:449)
at java.util.AbstractList$Itr.next(AbstractList.java:420)
at ir.assignment03.MyCrawler.dumpMyData(MyCrawler.java:93)
at ir.assignment03.MyCrawler.onBeforeExit(MyCrawler.java:74)
at edu.uci.ics.crawler4j.crawler.CrawlController.start(CrawlController.java:123)
at ir.assignment03.Controller.main(Controller.java:20)
Also, to me it looks like you have a stop.txt file in your crawl folder which stops crawlers sooner than expected. If this is the case you should only put it there when you're sure there is nothing more to do. -
Inappropriate?Ah, I didn't notice that when I glanced at it. We'll check our stuff and try to make sure different threads try to write to the same files. We're dumping all of our data in text files every 5000 pages. We use the id of the crawler in the filename to try to have each thread writing to its own text files.
I assume that putting the stop.txt file in right away caused the DbState to be closed rather than open. Do you have any idea how we got the error "Error while puting the url in the work queue."?
I’m silly
-
Inappropriate?"Error while puting the url in the work queue." happens when Berkeley DB cannot write the new url in its storage. It has probably happened because it was closed. If you delete the stop.txt file there would be no problem.
-
Inappropriate?Thanks Michael. If you are having problems with multiple threads concurrently accessing the same data structure, you can take a look at the suggestion listed here:
http://java.sun.com/j2se/1.4.2/docs/a...
After "Note that this implementation is not synchronized"
If multiple threads are accessing the same data structure the data structure needs to be synchronized or crazy things can happen.
-
Inappropriate?Thanks for all the info!
Hopefully our program itself won't try to access the same data structures or text files from different threads.
By the way, another think that might have affected us is that we ran the nohup java process more than once. This could have also caused concurrent access to the text files we were making because some of our old crawls which we thought we stopped may have been still running in actuality. If anybody hits the same problems here's what I did:
- After logging into openlab on putty, i entered "ps -u <icsid>". That gave me all the processes I was still running. Ex: For me, it was "ps -u mlavaves"</icsid>
- I looked for all the process ids (Pid) that were associated with java.
- I killed all those old processes using "kill <pid>". Sometimes, they wouldn't dissappear right away. When that happened, I think I tried "kill -KILL </pid><pid>". Even after that, it may still take a while. Ex: if your pid is 20239, "kill 20239" or "kill -KILL 20239".</pid>
Make sure to check every server your process could be running on.
I had to do this on mothra, rodan, and godzilla. You can choose which one to log into directly by giving its hostname directly, instead of using "openlab.ics.uci.edu". A list of them can be found here. -
Inappropriate?I ran into the same problem. I found it easier to just write down the name of the server you run on...otherwise you can easily lose track of where your process is running! (Alternatively, helpdesk can kill it for you if you lose it...but they're probably closed this weekend).
Loading Profile...



EMPLOYEE
EMPLOYEE