Recent activity
Subscribe to this feed
infection0 replied on February 16, 2009 19:07 to the question "For some reason I can't change Java version on Putty." in LUCI:
infection0 asked a question in LUCI on February 16, 2009 18:59:
For some reason I can't change Java version on Putty.I am using Putty on Windows, using SSH to port 22 on simpsons.ics.uci.edu and/or family-guy.ics.uci.edu.
Here is my console:
login as: ******
Using keyboard-interactive authentication.
Password: ******
Last login: Sun Feb 15 21:25:48 2009 from dhcp-063055.mob
Sun Microsystems Inc. SunOS 5.10 Generic January 2005
> module avail java
module: Command not found.
> bash
bash-3.00$ module avail java
bash: module: command not found
Any ideas on what is wrong?
infection0 replied on February 05, 2009 23:44 to the idea "Picking your java version on openlab" in LUCI:
infection0 replied on February 02, 2009 05:47 to the question "How do we run our crawler from openlab?" in LUCI:
infection0 replied on February 02, 2009 05:38 to the question "Crawler taking up massive amounts of space on my HDD." in LUCI:
I entered this filter before the en.wikipedia.org/wiki check... still no change in the program's behavior. I have not modified my code otherwise.
if (href.substring(5).contains(":"))
{
return false;
}
My frontier folder has already grown to 4gb since last I posted and my crawl has slowed to 5pps again.
infection0 replied on February 02, 2009 02:24 to the question "Crawler taking up massive amounts of space on my HDD." in LUCI:
infection0 replied on February 02, 2009 01:52 to the question "Crawler taking up massive amounts of space on my HDD." in LUCI:
infection0 replied on February 02, 2009 01:25 to the question "Crawler taking up massive amounts of space on my HDD." in LUCI:
infection0 replied on February 02, 2009 01:15 to the question "Crawler taking up massive amounts of space on my HDD." in LUCI:
I am using the filters supplied on the example website and the stock shouldVisit() method given (with the website edited to be starting with "http://en.wikipedia.org/wiki/"). I have a 16GB crawl folder. The crawler only returns results from only wikipedia pages, so I am pretty sure it is not visiting other sites.
It is not our own files; our logs take up a total of less than 1MB. It is the /craw/frontier folder that is large.
This has been the case for every crawl we have done. In addition, over time the crawl would slow to 5 pages per second. I've noticed several other groups with this problem... I wonder what it could be?
EDIT: I should also note that our CPU usage is very low; it is likely not the efficiency of our palindromes that is the bottleneck. It's probably the massive amounts of reading/writing to the hard drive for an inexplicable reason... but right now it's too late to restart the crawl anyway. We will end up with about 1/3 of Wikipedia crawled...
infection0 replied on February 01, 2009 08:36 to the question "Don't know how to use the script..." in LUCI:
it's a no-go. Same exception:
Exception in thread "main" java.lang.NoClassDefFoundError: com.sleepycat.je.latch.Latch$JEReentrantLock
at com.sleepycat.je.latch.Latch.<init>(Latch.java:40)
at com.sleepycat.je.dbi.EnvironmentImpl.<init>(EnvironmentImpl.java:271)
at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:147)
at com.sleepycat.je.Environment.<init>(Environment.java:210)
at com.sleepycat.je.Environment.<init>(Environment.java:150)
at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:58)
at ir.assignment03.Controller.main(Controller.java:18)
Is it because the java on these machines (family-guy.ics.uci.edu) are weird? Is it because it's 64-bit? Is it because it's a different OS? I don't get it... it runs perfectly fine on the other machines. For now I am using those, but I would like to know if it is possible to use these machines in the future.</init></init></init></init></init>
infection0 replied on February 01, 2009 04:59 to the question "Don't know how to use the script..." in LUCI:
infection0 asked a question in LUCI on February 01, 2009 04:39:
Don't know how to use the script...Can you provide a small tutorial on how to get the shell script to run (I am using an ICS centOS server remotely)? I have done the +chmod stuff but crawl-log.txt is giving me this exception:
This is the script I am trying to use:
#!/bin/bash
cp="."
for f in $(ls lib/*); do
cp=$cp:$f
done
nohup java -Xmx2048M -classpath $cp ir.assignment03.Controller $1 $2 > crawl-log.txt &
Exception in thread "main" java.lang.NoClassDefFoundError: ir.assignment03.Controller
at gnu.java.lang.MainThread.run(libgcj.so.7rh)
Caused by: java.lang.ClassNotFoundException: ir.assignment03.Controller not found in gnu.gcj.runtime.SystemClassLoader{urls=[file:./,file:lib/commons-codec-1.3.jar,file:lib/commons-httpclient-3.1.jar,file:lib/commons-logging-1.1.jar,file:lib/crawler4j-1.0.5.jar,file:lib/dsiutils-1.0.7.jar,file:lib/fastutil-5.1.5.jar,file:lib/je-3.3.74.jar,file:lib/log4j-1.2.15.jar], parent=gnu.gcj.runtime.ExtensionClassLoader{urls=[], parent=null}}
at java.net.URLClassLoader.findClass(libgcj.so.7rh)
at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
at java.lang.ClassLoader.loadClass(libgcj.so.7rh)
at gnu.java.lang.MainThread.run(libgcj.so.7rh)
I exported a runnable .jar file in Eclipse (output was 15MB) and sent it to the server. It runs fine on Solaris machines, but it throws exceptions on these machines. I am unsure if I need to copy over dependencies or what, but I have no idea where to put them.
----------------------------------------
ALTERNATIVE SCRIPT:
using this script I get:
#!/bin/bash
cp="."
for f in $(ls lib/*); do
cp=$cp:$f
done
nohup java -Xmx2048M -classpath .:epiphany.jar ir.assignment03.Controller $1 $2 > crawl-log.txt &
I get:
Exception in thread "main" java.lang.NoClassDefFoundError: com.sleepycat.je.latch.Latch$JEReentrantLock
at com.sleepycat.je.latch.Latch.<init>(Latch.java:40)
at com.sleepycat.je.dbi.EnvironmentImpl.<init>(EnvironmentImpl.java:271)
at com.sleepycat.je.dbi.DbEnvPool.getEnvironment(DbEnvPool.java:147)
at com.sleepycat.je.Environment.<init>(Environment.java:210)
at com.sleepycat.je.Environment.<init>(Environment.java:150)
at edu.uci.ics.crawler4j.crawler.CrawlController.<init>(CrawlController.java:58)
at ir.assignment03.Controller.main(Controller.java:22)
Which is what another student got.
Where do I put the dependencies?
-------------------
So basically, which script is correct and where should I put dependencies?</init></init></init></init></init>
infection0 replied on January 31, 2009 05:52 to the question "Crap. Of all the things to time out on..." in LUCI:
infection0 asked a question in LUCI on January 31, 2009 05:25:
Crap. Of all the things to time out on...I happened to see this in console after a particularly long crawl:
ERROR [Crawler 19] (PageFetcher.java:93) - Fatal transport error: Read timed out while fetching http://en.wikipedia.org/wiki/Barack_O...
...what do I do now? Am I screwed out of his docID? Can I just tell you guys the code I would have used to find his docID (it works when I test it)? I cannot afford to restart my crawl at this point.
infection0 replied on January 31, 2009 02:37 to the question "How do I kill my own program????" in LUCI:
infection0 replied on January 31, 2009 02:15 to the question "How do I kill my own program????" in LUCI:
infection0 replied on January 30, 2009 21:50 to the question "How do I kill my own program????" in LUCI:
infection0 asked a question in LUCI on January 30, 2009 21:50:
How do I kill my own program????It grew into a monster.
I used nohup to run it and now it just WON'T STOP!!! I've tried screwing with it by deleting its frontier queue and files it saves to, but nothing. I don't know how to use the shell to kill it.
infection0 replied on January 30, 2009 19:17 to the question "How do I make my crawl faster?" in LUCI:
Before you updated crawler4j to 1.04, we built our own mechanism for aggregating data. Here's how:
-In the constructor of each "Crawler" class, each crawler "registers" itself in a master list of crawlers. There is a separate thread that runs that iterates through each crawler every 2 minutes collecting new data.
Should we go back and download 1.04 and use those features instead?
And yeah, we are working on finding out what is wrong with our palindromes.
infection0 asked a question in LUCI on January 30, 2009 13:59:
How do I make my crawl faster?I can only crawl at a max of 10 pages per second. My average is about 7 or 8 (in the beginning, too!)
Currently our structure is to have each crawler maintain its own counts until I ask for them to save and aggregate. Our parsing is done in each individual crawler thread and the 10 largest lipogram/palindrome/rhopalics are kept in each crawler until I ask for them.
This aggregation is performed every 2 minutes. Our parsing is done inside the Crawler class. Will it be faster to put it in another thread somehow? What other tips do you have for making it faster (besides checking over the algorithms we use to parse)
EDIT:
Oh god, it's gotten much worse after just 30 minutes of crawl. Now the pages fetched per second are just returning binary... 0's and 1's...
| next » « previous |
Loading Profile...

