2020. 7. 31.

In this lecture, we will learn about TF-IDF.




As we learned in last lecture, Vector Space Model (aka Bag-of-Words) works as below.

Vector Space Model

A document (datapoint) is a vector of counts over each word (feature)

Vd is just a histogram over words.


n( · ) counts the number of occurences.





What is the similarity between two documents?

We can use any distance but the cosine distance is fast.

cosine distance


But not all words are created equall.






: Term Frequency Inverse Document Frequency


We weigh each word by a heuristic.




So we can count the words by TF-IDF.

