Recent activity
Subscribe to this feed
Paul Purtell replied on March 14, 2009 00:35 to the question "Making the Binary File" in LUCI:
Paul Purtell replied on March 14, 2009 00:22 to the question "Missing keyword in the Wikipedia-Top500000.txt file" in LUCI:
It really can't be since going to http://en.wikipedia.org/wiki/ redirects to the main page , I believe it to be an error, but maybe they threw it in "to build character"...
Paul Purtell replied on March 13, 2009 19:58 to the question "Demos on Monday" in LUCI:
Paul Purtell asked a question in LUCI on March 13, 2009 19:55:
50,000 occurance and/or 10% of documents +code submission questionA few clarifications I remember hearing something like don't include terms that occur in more than 10% of documents, but then I read in the specs don't use terms that occur more than 50,000 times, so we should use both of those in conjunction or just the 10% because 10% of 500,000 is 50,000...
And for code submission, do you also want the files that we used to generate the .data and offset files, or just the actual data and offset files, because they are quite large. I have a feeling that you would want a workable project and this point is confusing...
Paul Purtell replied on March 12, 2009 01:49 to the question "url file" in LUCI:
Paul Purtell asked a question in LUCI on March 11, 2009 00:31:
url fileCould you post the file that has the 500,000 URLs and their associated docids?
Paul Purtell replied on February 19, 2009 00:06 to the question "yet another dfs problem" in LUCI:
someone in class had recommended using the /tmp directory on the local machine, so my temp dir was set to that instead of /extra/ugrad_space/...., I'm not sure as to why that would really cause a problem, but reverting it back to what was originally suggested in the example hadoop-site.xml file has resolved this issue
Paul Purtell replied on February 18, 2009 03:25 to the question "yet another dfs problem" in LUCI:
no, I didn't change the replication property value from 1.
here is a snippet from my log file, it seems that the datanode(s) may not be starting up in the first place... but i'm still nowhere closer to solving the problem
2009-02-17 17:59:07,286 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /extra/ugrad_space/ppurtell/hadoop-0.19.0/tmp/hadoop-ppurtell/dfs/data: namenode namespaceID = 372424651; datanode namespaceID = 1899825713
Paul Purtell asked a question in LUCI on February 18, 2009 02:13:
yet another dfs problemthe dfs goes crazy when I try to copy over the input file
bin/hadoop dfs -copyFromLocal input input
09/02/17 18:10:08 WARN hdfs.DFSClient: NotReplicatedYetException sleeping /user/ppurtell/input retries left 1
09/02/17 18:10:12 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/ppurtell/input could only be replicated to 0 nodes, instead of 1
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1270)
at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892)
at org.apache.hadoop.ipc.Client.call(Client.java:696)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216)
at $Proxy0.addBlock(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2815)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2697)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:1997)
at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2183)
A comment on the question "How do I kill my own program????" in LUCI:
sorry, to exit prstat press q, otherwise you will be unable to enter commands until you do.. – Paul Purtell, on January 31, 2009 02:25
Paul Purtell replied on January 31, 2009 02:23 to the question "How do I kill my own program????" in LUCI:
Paul Purtell replied on January 29, 2009 07:38 to the question "Program slowing down over time..." in LUCI:
Paul Purtell asked a question in LUCI on January 29, 2009 06:50:
Program slowing down over time...My program was processing about 27 pages per second in the beginning. I think this to be a reasonable rate, but after running for about 12 hours it is now running at about 4 pages per second, and this has been for a while. When i look at all the processes I find that there are no competing process processes and no real reason why the program would slow down. My approach doesn't use any sort of large data structure that might get bogged down with size. I am noticing that in the process list, the cpu usage by my program is now very low about 1% or lower and the process is "sleeping" a lot. CPU usage used to be quite high. Is anyone else getting a similar problem?
Is it possible that when a thread dies it is recreated but stops working so that eventually all threads are doing nothing?
The crawler keeps sending messages to the console that basically "the crawl is finished type yes to quit", I ignore these because I can see that there are still valid links that haven't been visited yet.
Using crawler4j 1.0.2
on openlab.ics.uci.edu
with "nohup java -Xmx2048M Controller > status &"
there are 20 crawlers specified....
A comment on the question "shell script not working. `$' unexpected" in LUCI:
I'm pretty sure the limiting factor (at home) for most people is going to be the bandwidth, Most broadband connections are around 1.5 Mbit which is easily saturated by this crawler. – Paul Purtell, on January 29, 2009 06:35
Paul Purtell asked a question in LUCI on January 16, 2009 05:55:
different openlab servers, same program, fails to run on one, but runs fine on other.I noticed something, it seems that different servers on openlab have different settings or something like that. I tested my program, and it runs fine on "godzilla.ics.uci.edu" but I then tried it on "family-guy.ics.uci.edu" same exact program and it failed to run, so I thought I should make you aware of this anomaly... also I had to make sure its compatible with 1.5 at least for godzilla because they don't have 1.6 installed on openlab yet.
This is Paul Purtell
Is there a problem with thread safety if we are not careful coding as it is running on multiple cores??
Loading Profile...
