Term frequency-inverse document frequency (TF-IDF) is a means of assigning weight to a search term when comparing individual documents within a corpus. TF-IDF is an improvement on the bag-of-words model in that it considers the relative rarity of a term within a larger corpus.
TF-IDF consists of two parts, the first of which is term frequency, denoted , where is a term and is a document. By calculating for a single term across all documents, one can get a basic sense of which documents are most associated with the term. However, some terms carry little relative meaning because they appear consistently across all documents in the corpus. As a result, term frequency alone can be of limited use.
The second element of TF-IDF is the measure of inverse document frequency. The purpose of this measure is to assess the importance of a term in a particular document rather than simply its frequency. The document frequency is the number of documents in a collection that contain a term at least once. The inverse document frequency is set on a logarithmic scale such that of a particularly rare term will be high and the of a common term will be low. is calculated with the below formula, where is the number of documents in the corpus.
The two parts of TF-IDF combine such that the term frequency is multiplied by the inverse document frequency:
As a measure of weight relating to the importance of a term in a particular document, is low when the term is common across all documents or occurs very few times in a particular document, and high when the term is rare across all documents but occurs frequently in a particular document.