Recent activity
Subscribe to this feed
Michael L. replied on March 15, 2009 18:19 to the question "Making the Binary File" in LUCI:
Michael L. replied on March 15, 2009 02:05 to the question "Making the Binary File" in LUCI:
Michael L. marked one of Paul Purtell's replies in LUCI as useful. Paul Purtell replied to the question "Making the Binary File".
Michael L. replied on March 14, 2009 00:15 to the question "Making the Binary File" in LUCI:
Michael L. asked a question in LUCI on March 13, 2009 21:32:
Making the Binary FileAbout how long should it take to make the binary file for the 2.8GB posting list? Maybe we're just being inefficient or something, but our program is taking forever.
Michael L. replied on March 12, 2009 18:32 to the question "What's a good way to create a binary file from Java?" in LUCI:
It's the RandomAccessFile. Yasser used it in discussion:
http://www.ics.uci.edu/~djp3/classes/...
(Last slide.)
Michael L. replied on March 02, 2009 20:23 to the question "Sorting problems" in LUCI:
I think, if you used bin/hadoop dfs -copyToLocal output output, all your "part-xxxxx" files would be in the output folder, not output/output. Just check inside the output folder to see where those files are.
Also, even if it is output/output, I think that extra / in front is trying to do sort -m on /output/output/* instead of /extra/ugrad_space/jdauz/hadoop.../output/output/*
And don't forget to move any directories out of there before using such as _logs.
Michael L. replied on March 01, 2009 19:28 to the question "Server down?" in LUCI:
Michael L. asked a question in LUCI on February 28, 2009 21:25:
Non-ASCII characters in Posting ListShould we be including non-ascii characters in our posting list? The reason I ask is my computer is able to read Asian characters. However, in my part-* files and outputMerged.txt, I still get things like
oï¿1⁄2uï¿1⁄2ï¿1⁄23ï¿1⁄2wï¿1⁄2ï¿1⁄2vï¿1⁄2ï¿1⁄2rcï¿1⁄2ï¿1⁄2ï¿1⁄2ï¿1⁄2
as terms.
(This was run on the 200 url input.)
To see for youself (assuming your computer can read Asian characters):
http://www.ics.uci.edu/~mlavaves/outputMerged.txt
To my knowledge, outputMerged.txt is saved in UTF-8 format, so if there were Asian characters in there to begin with, they should display properly.
I know I would still get that junk text for other languages I don't have installed on my computer (like Arabic), but I don't see any Asian characters at all. There should at least be a few from some of the pages that offer translations in those languages...
Michael L. replied on February 19, 2009 00:07 to the question "error running jar on openlab" in LUCI:
Michael L. replied on January 30, 2009 19:37 to the question "Work Queue and DbState" in LUCI:
Thanks for all the info!
Hopefully our program itself won't try to access the same data structures or text files from different threads.
By the way, another think that might have affected us is that we ran the nohup java process more than once. This could have also caused concurrent access to the text files we were making because some of our old crawls which we thought we stopped may have been still running in actuality. If anybody hits the same problems here's what I did:
- After logging into openlab on putty, i entered "ps -u <icsid>". That gave me all the processes I was still running. Ex: For me, it was "ps -u mlavaves"</icsid>
- I looked for all the process ids (Pid) that were associated with java.
- I killed all those old processes using "kill <pid>". Sometimes, they wouldn't dissappear right away. When that happened, I think I tried "kill -KILL </pid><pid>". Even after that, it may still take a while. Ex: if your pid is 20239, "kill 20239" or "kill -KILL 20239".</pid>
Make sure to check every server your process could be running on.
I had to do this on mothra, rodan, and godzilla. You can choose which one to log into directly by giving its hostname directly, instead of using "openlab.ics.uci.edu". A list of them can be found here.
Michael L. replied on January 30, 2009 08:18 to the question "Work Queue and DbState" in LUCI:
Ah, I didn't notice that when I glanced at it. We'll check our stuff and try to make sure different threads try to write to the same files. We're dumping all of our data in text files every 5000 pages. We use the id of the crawler in the filename to try to have each thread writing to its own text files.
I assume that putting the stop.txt file in right away caused the DbState to be closed rather than open. Do you have any idea how we got the error "Error while puting the url in the work queue."?
Michael L. asked a question in LUCI on January 30, 2009 02:04:
Work Queue and DbStateHey everyone.
Our group has been getting some interesting errors. We've tried two crawls. One last night that hit the errors this morning, and another we started this morning and stopped this afternoon.
Crawl 1 data:
crawl-log.txt
nohup.out
Crawl 2 data:
crawl-log.txt
nohup.out
For both we used...
Crawler4j 1.0.5
openlab.ics.uci.edu
the nohup script from Yasser
with 20 crawlers.
Michael L. asked a question in LUCI on January 28, 2009 04:55:
Nohup Questions.I'm unfamiliar with "nohup" and have a couple questions. I'm planning to use it to run my team's code from openlab.
1. I understand that if I disconnect for a while, my java program will keep running. What happens if I turn off my laptop? How does it know when to stop?
2. If we can't enter in data after the crawler stops, how can we get the total data as in the example code from 1.0.4? (This input would be from the "should i stop crawling?" question at the end.
Loading Profile...
