Questions And Answers for Crawling Assignment
The followings are questions by one of your classmates and my answers:
> I have been checking my partial results since my crawler crashed last
> night, and I have some questions about them:
>
> - I found a "palindrome" which is 403 a's, but when I checked the article,
> it wasn't present since it was edited last tuesday after my crawler took
> the page. Same has happened with some other articles which have/had a
> bunch of letters without a meaning (232 j's or 157 l's). Are they
> considered correct?
These are samples of vandalism in Wikipedia which also get reverted very soon. Although they match the rules for palindrome extraction, we are looking for interesting palindromes and not trivial ones. You can report these but it is better to filter those that have only a few letters and report more interesting ones.
>
> - I found some rhopalics which are just one very long word. They are also
> unreadable (the longest one is a word which is jg and 119 h's, and it is
> still present in the article
> http://en.wikipedia.org/wiki/Wildlife... ). Some of them are not
> present now in the articles, but they were some days ago. Should I remove
> them and pick the following ones?
As explained in the assignment specification, for rhopalics we will count number of words and not the length of string.
>
> Regardless of this, I am wondering if I should restart my crawler since it
> crashed. I got 260.000 pages in 3.5 days. If I start again, I'm going to
> get almost the same pages, and I can't keep track of the statistics. What
> do you recommend me to do?
I recommend saving your current statistics and trying one more time. If the next results are better report them, otherwise report the first results.
>
> By the way, I was wondering how you could crawl the whole wikipedia in 10
> hours. According to my calculations, you must parse more than 100 pages
> per second.
Yes, I processed between 90-120 pages per second.
> I think my algorithms are pretty efficient (the lipogram and
> rhopalic function run in O(n), which I think is the minimum; the
> palindrome function is my bottleneck, but I think it couldn't be much more
> efficient), but it takes almost 2 or 3 seconds to parse some large
> documents (the small ones are parsed in some miliseconds).
On my old Pentium 4, the Palindrome Extractor takes about 0.4 seconds on largest articles. But the 10 hours record was on a much faster machine!
> I have been checking my partial results since my crawler crashed last
> night, and I have some questions about them:
>
> - I found a "palindrome" which is 403 a's, but when I checked the article,
> it wasn't present since it was edited last tuesday after my crawler took
> the page. Same has happened with some other articles which have/had a
> bunch of letters without a meaning (232 j's or 157 l's). Are they
> considered correct?
These are samples of vandalism in Wikipedia which also get reverted very soon. Although they match the rules for palindrome extraction, we are looking for interesting palindromes and not trivial ones. You can report these but it is better to filter those that have only a few letters and report more interesting ones.
>
> - I found some rhopalics which are just one very long word. They are also
> unreadable (the longest one is a word which is jg and 119 h's, and it is
> still present in the article
> http://en.wikipedia.org/wiki/Wildlife... ). Some of them are not
> present now in the articles, but they were some days ago. Should I remove
> them and pick the following ones?
As explained in the assignment specification, for rhopalics we will count number of words and not the length of string.
>
> Regardless of this, I am wondering if I should restart my crawler since it
> crashed. I got 260.000 pages in 3.5 days. If I start again, I'm going to
> get almost the same pages, and I can't keep track of the statistics. What
> do you recommend me to do?
I recommend saving your current statistics and trying one more time. If the next results are better report them, otherwise report the first results.
>
> By the way, I was wondering how you could crawl the whole wikipedia in 10
> hours. According to my calculations, you must parse more than 100 pages
> per second.
Yes, I processed between 90-120 pages per second.
> I think my algorithms are pretty efficient (the lipogram and
> rhopalic function run in O(n), which I think is the minimum; the
> palindrome function is my bottleneck, but I think it couldn't be much more
> efficient), but it takes almost 2 or 3 seconds to parse some large
> documents (the small ones are parsed in some miliseconds).
On my old Pentium 4, the Palindrome Extractor takes about 0.4 seconds on largest articles. But the 10 hours record was on a much faster machine!
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?Another question:
>Does the String text = page.getText() method read the text files in
>UTF-8 format? I've been getting some palindrome and rhopalic results
>which contain a lot of question marks, and when I go to that page on
>Wikipedia, that line of text is, for example, in Russian. However,
>when I copy and paste those lines from Wikipedia to read them from
>file and test this with my Lipogram program from Assignment 2, the
>correct lipogram is found (without those Russian or question mark
>characters). Any ideas?
Yes, page.getText() returns the unicode version of the text. But, I guess you're dumping the text in a file without saving it in UTF-8 format. This will convert all of the non-ASCII characters to question marks when saving.
Anyway, palindromes and rhopalics containing non-unicode characters don't match our rules and are not accepted. Therefore you don't need to worry about encodings. -
Inappropriate?Great questions!
Feel free to put the questions straight onto get satisfaction (I presume these were emailed to Yasser)
-
Inappropriate?I've encountered this error while atempting to run my
jar'd webcrawler on ONLY the /extra/ugrad_space. When I run the jar on my
local machine I do not encounter any errors.
Exception in thread "main" java.lang.UnsupportedClassVersionError: Bad
version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:620)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:268)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
Any ideas why I might be contracting this error? I changed my java compiler multiple times to different versions in an attempt to match the open lab system; this still did not work, however it still works on my home system. It is not a matter of declaring the proper location for the crawl, I change it for each location I am running it on. -
Inappropriate?Open lab machines need JDK 1.5. I guess on your home machine you have JDK 1.6. You can download 1.5 and configure eclipse to build and compile your jar file with that version.
-
Inappropriate?There is also a setting in Eclipse project settings to build for 1.5.
I don't think that requires downloading anything new. -
Inappropriate?Thank you!
After installing the entire jdk 1.5 and remaking the project I successfully crawled on openlab.
I’m happy
-
Inappropriate?Prof. Patterson is right: an easier way is to go to Project > Properties, Java Compiler on the left side, then select 1.5 for JDK Compiler Compliance level. I ran into the same problem and that fixed it.
Loading Profile...


EMPLOYEE
EMPLOYEE
