I made a diagram.
For the stupid people like me. Actually it's more like a cheat sheet.
It would be nice if someone could verify this is the correct idea. :)
(Click for full size)
It would be nice if someone could verify this is the correct idea. :)
(Click for full size)
1
person likes this idea
I like this idea!
Tell me when this idea gets some attention.
The more people who like this idea, the more it gets noticed.
The more people who like this idea, the more it gets noticed.
Create a customer community for your own organization
Plans starting at $19/month
-
Inappropriate?Even more readable version:
query: 'hot dog'
tf_idf(word) = tf * log(N/df)
q = query vector: [tf_idf(hot), tf_idf(dog)] = [2, 3]
d = doc vector: [tf_idf(hot), tf_idf(dog)] = [8, 12]
similarity score = ((q['hot'] * d['hot']) + (q['dog'] * d['dog'])) = ((2*8)+(3*12)) = 52
Hope this is a) right and b) helps someone. -
Inappropriate?is that right? that's not how I thought I was going to do it. The book and slides are so confusing on their equations. :/
-
Inappropriate?Yes, I agree figuring out how to actually do the scoring was a pain. How I explained it is (if I remember correctly) how we implemented it and the scoring is actually useful in that the most relevant documents come out on top.
If I am wrong and confusing you even more, sorry. Taking the same example, how were you going to do it? -
Inappropriate?mmk here is how I thought things were going to go:
For each term in the query...
CALCULATE: IDF of the term.
for each <docid>=<tf> in the postings list...
CALCULATE: TF-IDF of the term/document.
SUM: Score(docid) += TF-IDF
So Score(docid) would get added to once for each term in the query, and at the end the docid with the highest score would be at the top?
Is that right? Or am I missing this whole big thing with vectors?
Oh and how do I calculate a query vector and document vector?
"The score of a document d is the
sum, over all query terms, of the number of times each of the query terms
occurs in d. We can refine this idea so that we add up not the number of
occurrences of each query term t in d, but instead the tf–idf weight of each
term in d."</tf></docid>
I’m regretting ditching math
-
Inappropriate?Oh lol, the score is the sum of the product of the TF-IDF of the term AND the TF-IDF of the query.
SUM: Score(docid) += TF-IDF(term) * TF-IDF(query).
Man that was easy. Stupid me. Thank you Nathan for helping me out here. You were absolutely right. I just had to stare at it long enough to get it.
I’m stupid
Loading Profile...




EMPLOYEE