Reference: TF-IDF

October 11, 2020

Term frequency-inverse document frequency (TF-IDF) is a means of assigning weight to a search term when comparing individual documents within a corpus. TF-IDF is an improvement on the bag-of-words model in that it considers the relative rarity of a term within a larger corpus.

TF-IDF consists of two parts, the first of which is term frequency, denoted $t f_{t, d}$ , where $t$ is a term and $d$ is a document. By calculating $t f_{t, d}$ for a single term across all documents, one can get a basic sense of which documents are most associated with the term. However, some terms carry little relative meaning because they appear consistently across all documents in the corpus. As a result, term frequency alone can be of limited use.

The second element of TF-IDF is the measure of inverse document frequency. The purpose of this measure is to assess the importance of a term in a particular document rather than simply its frequency. The document frequency is the number of documents in a collection that contain a term $t$ at least once. The inverse document frequency is set on a logarithmic scale such that $i d f_{t}$ of a particularly rare term will be high and the $i d f_{t}$ of a common term will be low. $i d f_{t}$ is calculated with the below formula, where $N$ is the number of documents in the corpus.

i d f_{t} = l o g (\frac{N}{d f_{t}})

The two parts of TF-IDF combine such that the term frequency is multiplied by the inverse document frequency:

t f - i d f_{t, d} = t f_{t, d} \times i d f_{t}

As a measure of weight relating to the importance of a term in a particular document, $t f - i d f_{t, d}$ is low when the term is common across all documents or occurs very few times in a particular document, and high when the term is rare across all documents but occurs frequently in a particular document.

Severin Perez

Reference: TF-IDF

You might enjoy...