Keeping count...
How do I keep track of the CF, DF, and number of times each term occurs in a document? I'm not sure where to put it, or if I need a separate class. I am thinking of having a separate class with a hashmap containing keys of terms and counts, but that seems like a huge bottleneck.
1
person has this question
I have this question, too!
Tell me when someone answers.
The more people who ask this question, the more it gets noticed.
The more people who ask this question, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?You don't need a separate class.
Just count them in your reduce function.
The output of reduce is a string which has all the data that you collected in it.
-
Inappropriate?Hmm, my problem is that I don't have a good idea of where or how I should collect the data. My current methods are really "hacky" and cause my program to fail.
-
Inappropriate?In a general sense - not in your particular case.
-
Inappropriate?Map class - Downloads the text, parses it into tokens,
output.collect(token, new Text(docID + ""));
I think I have Map down and I am just messing with Reduce.
I used to filter out duplicate words in the map function so that only the first occurrence of a term in a document is added to the output collector.
We are considering removing that and counting the number of duplicate docIDs in the Reducefunction... but for some reason I am having an ungodly amount of trouble with that. You want us to count the number of occurrences of a term for each docID right? I am trying to tack on that number to the end of the docID before collection, but I think that is screwing everything up when it calls reduce again.
I have no idea why but my output looks like this:
Term (0) (0) [ ]
when without the changes (before I tried to add this stuff) it looks like:
Term [24, 48, ...]
This is how I'm trying to format the final output:
Term (3 occurrences) [24 (2), 48(3), ...]
I am doing this by adding the CF to the Term text, and adding the DF into each docID collected. I think this is a weird way of doing it, but I don't know what else to do.
Here's a small code snippet from Reduce that illustrates the current state of my program:
ArrayList<string> docIDOutputs = new ArrayList<string>();
String docIDPlusCount = docID + " (" + documentTermCount + ")";
docIDOutputs.add(docIDPlusCount);
String outputString = (iterating through the list of docIDs and adding them to the string one by one)
String termPlusCorpusCount = term.toString() + corpusTermCount;
output.collect(new Text(termPlusCorpusCount), new Text(outputString));
I know I'm doing something wrong, I just don't know how to do it right...</string></string>
I’m wtf is happening
-
Inappropriate?For starters, you should output all of the tokens that you parse in map in <termid> pairs.
That way you can count them in reduce.
This is different than assignment 5 where all we cared about was the existence of a word. Now we care about how many there are.
Secondly make sure you aren't setting a Combiner class
(don't do this: job.setCombinerClass(MyJob.MyReducer.class);)
Although using a combiner can be used for optimizing. It's too confusing for a first time user of Hadoop.
1 person says
this answers the question
-
Inappropriate?Thanks! That job.setCombinerClass() was the thing that was doing me in. It was in the code example so I used it not thinking about what it did. Now to see if I can start this thing...
I’m very, very late
Loading Profile...



EMPLOYEE