Reference: Bag-of-words
October 11, 2020
The bag-of-words (BoW) model is used to represent a document as a fixed-length vector of frequency counts for each token in the document. BoW does not take into account the order of the words. BoW models are useful in classification, machine learning, and topic modeling tasks, among others.
BoW is a relatively simplistic model and requires two elements. The first is a vocabulary, which is the set of all unique tokens that appear across a corpus. Using this vocabulary, each document can be converted into a vector, where the length of the vector is the same as the length of the vocabulary. The individual elements in the vector represent the frequency that a particular token occurs in the document. Note that order is important as the index of a count in a vector corresponds to the index of a word in the vocabulary.
For example, consider the below short sentences and their corresponding vectors.
doc1 = "the man ran from the dog"
doc2 = "my dog is the best dog"
doc3 = "dog is man's best friend"
vocab = ['best', 'dog', 'friend', 'from', 'is',
'man', "man's", 'my', 'ran', 'the']
vec1 = [0, 1, 0, 1, 0, 1, 0, 0, 1, 2]
vec2 = [1, 2, 0, 0, 1, 0, 0, 1, 0, 1]
vec3 = [1, 1, 1, 0, 1, 0, 1, 0, 0, 0]
When preparing a BoW model, it is important to consider tokenization. For example, should a capitalized word be counted separately from a lowercase version of the same word? Should lemmatization or stemming be used? In the above example, man
and man's
are considered separate tokens, which may not be the intended result. Each of these considerations will change the vector representation of individual documents.
In general, the BoW model can be used to assess how important a token is in a given text. In the above example, the token dog
appears twice in doc2
and only once in the other documents. However, the BoW model does not take into account the relative frequency of a token throughout a corpus. In this case, dog
appears in every document and arguably has less importance as a differentiating weight for doc2
compared to the other documents. This problem is something that the TF-IDF model attempts to correct.