Reference: Stemming
August 23, 2020
In natural language processing (NLP), stemming is the process of reducing a word to its stem form. Typically, stemming is used as part of an NLP pipeline in order to reduce all words in a text to their stems so that they can be analyzed together. For example, when counting word frequency one might wish to count the words write and writing as a single frequency unit by using their stem form: write.
Although stemming is a useful heuristic, it is generally seen as crude because stemming merely cuts off the end of a word without consideration for whether the result is a real word. Some algorithms are more effective at handling this than others. Compare this to lemmatization, which takes word structure into an account in order to reduce a word to its lemma and thus retain more of the original meaning.
Stemming with Natural Language Toolkit
The Natural Language Toolkit (NLTK) provides two stemmers, which employ slightly different algorithms. Note how the Porter and Snowball stemmers produce different results for a difficult word like “generally”.
import nltk
from nltk.stem import *
import pandas as pd
def nltk_stem_df(text):
porter = PorterStemmer()
snowball = SnowballStemmer("english")
rows = []
tokens = nltk.word_tokenize(text)
for idx, token in enumerate(tokens):
row = {
"word": token,
"porter": porter.stem(token),
"snowball": snowball.stem(token)
}
rows.append(row)
return pd.DataFrame(rows)
sentence = "Writers generally write because they love writing"
nltk_stem_df(sentence)
# word porter snowball
# 0 Writers writer writer
# 1 generally gener general
# 2 write write write
# 3 because becaus becaus
# 4 they they they
# 5 love love love
# 6 writing write write