Questions about assignment 3
Hi, I have some questions regarding the third assignment. Maybe they will be covered next Monday during discussion, but I'd like to spend the weekend doing this assignment. My questions are the following:
- We should crawl the pages which match the regular expression but, should we also check if they are inside wikipedia? I mean, if the URL starts with http://en.wikipedia.org.
- Does it exist any method in the libraries to get the size of the downloaded content from a site? And for the text? My idea is just count a byte for each character in the text (as well as in the HTML).
- How can we restore the process if the crawling fails?
- Do we have to save all the doc ids? I know we just need the one of the article about Obama, but you may want more for the forth assignment.
- Do we have to change something in the way we measure the length of palindromes, lipograms and rhopalics?
- You said we are receiving a grade according to the longest known sequence, found by whom? By you or by our peers? If it's your sequence, could you give us those lengths for us to know when we may stop crawling?
So far, I think those are all my questions. Thanks in advance.
- We should crawl the pages which match the regular expression but, should we also check if they are inside wikipedia? I mean, if the URL starts with http://en.wikipedia.org.
- Does it exist any method in the libraries to get the size of the downloaded content from a site? And for the text? My idea is just count a byte for each character in the text (as well as in the HTML).
- How can we restore the process if the crawling fails?
- Do we have to save all the doc ids? I know we just need the one of the article about Obama, but you may want more for the forth assignment.
- Do we have to change something in the way we measure the length of palindromes, lipograms and rhopalics?
- You said we are receiving a grade according to the longest known sequence, found by whom? By you or by our peers? If it's your sequence, could you give us those lengths for us to know when we may stop crawling?
So far, I think those are all my questions. Thanks in advance.
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?- The regular expression for assignment 03 would be something like http://en.wikipedia.org/wiki/* (Therefore you don't need a separate check).
- Just count the bytes.
- I will explain this on Monday
- No, just save the one for Obama page.
- No
- Longest known palindrome would be known after crawls are finished based on submissions by your friends and those found in our previous crawls. The goal for you is to crawl all of Wikipedia; not to stop the crawl when you find a long palindrome.
The company and 1 other person say
this answers the question
-
Inappropriate?Restoring the process requires telling crawler4j to pick up where it left off, but also periodically saving your own statistics. Yasser will clarify on Monday.
I’m happy
Loading Profile...



EMPLOYEE
EMPLOYEE