Question for URLs
For the assignment06, we need to submit the URLs for the pages that contains one member's first name. Do we need to post the URL or just the docid? The URL is stored on the hadoop server, so in order for me to get the urls, I will need to either dump all the input500000 into my local to do the process or rerun the process again to get the urls for the matched docids.
Thanks,
Art
Thanks,
Art
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?See, I did this like,
if(URL.toLowerCase().contains(FIRSTNAME.toLowerCase()))
...blahblah,
But, I can't save the file on the dfs :(
So I have no idea how to do it.
I’m confused
-
Inappropriate?Oh. Stupid me. It saves into my /extra/ugrad_space/ folder.
-
Inappropriate?If the requirement is finding the URLs which contains first name then I am having huge problem right now since I didn't store any record for the urls which contains my first name. I thought the requirement is to list the URLs which the contents contain the first name :(. Actually, I have weird first name, so if I use that first name for finding the url which contains my first name, I can guarantee there is none :(
What I do for getting the result is writing other program to analysis the result which I got from the hadoop process as in term, cf, df and docids format and generating the report.txt and postinglist.txt as what the requirement required. I am not sure if this is correct or not, but it might give you some idea at least.
Art
I’m unsure
-
Inappropriate?HMM I am also unsure now of what they meant:
"List of (at most 1000) URLs containing first name of member 1 of your team"
Do they mean "/wiki/Barack_Obama" if my name was Barack, or do they mean the text...
-
Inappropriate?It means searching in the content of the page with that URL (Isn't it obvious? Otherwise there was no reason for crawling and indexing!).
You can also only report the docid. -
Inappropriate?Oh so basically we just find the spot in our input file that has our name as a term? OH. Wow I did a lot of unnecessary stuff.
-
Inappropriate?I am not sure if I explained it clearly! So again this is the normal process:
1) Crawl the pages.
2) Create the posting lists.
3) In the generated posting lists find the line which is associated with term which is your name and report the list of the docids in that line.
2 people say
this answers the question
-
Inappropriate?Oh yeah, that's what I meant. :) I typed "input" instead of "output" for some reason.
Loading Profile...



EMPLOYEE