Get your own customer support community
 

Questions And Answers for Crawling Assignment

The followings are questions by one of your classmates and my answers:

> I have been checking my partial results since my crawler crashed last
> night, and I have some questions about them:
>
> - I found a "palindrome" which is 403 a's, but when I checked the article,
> it wasn't present since it was edited last tuesday after my crawler took
> the page. Same has happened with some other articles which have/had a
> bunch of letters without a meaning (232 j's or 157 l's). Are they
> considered correct?

These are samples of vandalism in Wikipedia which also get reverted very soon. Although they match the rules for palindrome extraction, we are looking for interesting palindromes and not trivial ones. You can report these but it is better to filter those that have only a few letters and report more interesting ones.

>
> - I found some rhopalics which are just one very long word. They are also
> unreadable (the longest one is a word which is jg and 119 h's, and it is
> still present in the article
> http://en.wikipedia.org/wiki/Wildlife... ). Some of them are not
> present now in the articles, but they were some days ago. Should I remove
> them and pick the following ones?

As explained in the assignment specification, for rhopalics we will count number of words and not the length of string.

>
> Regardless of this, I am wondering if I should restart my crawler since it
> crashed. I got 260.000 pages in 3.5 days. If I start again, I'm going to
> get almost the same pages, and I can't keep track of the statistics. What
> do you recommend me to do?

I recommend saving your current statistics and trying one more time. If the next results are better report them, otherwise report the first results.

>
> By the way, I was wondering how you could crawl the whole wikipedia in 10
> hours. According to my calculations, you must parse more than 100 pages
> per second.

Yes, I processed between 90-120 pages per second.

> I think my algorithms are pretty efficient (the lipogram and
> rhopalic function run in O(n), which I think is the minimum; the
> palindrome function is my bottleneck, but I think it couldn't be much more
> efficient), but it takes almost 2 or 3 seconds to parse some large
> documents (the small ones are parsed in some miliseconds).

On my old Pentium 4, the Palindrome Extractor takes about 0.4 seconds on largest articles. But the 10 hours record was on a much faster machine!
Inappropriate?
1 person has this question

User_default_medium