August 23, 2020

In natural language processing (NLP), stemming is the process of reducing a word to its stem form. Typically, stemming is used as part of an NLP pipeline in order to reduce all words in a text to their stems so that they can be analyzed together. For example, when counting word frequency one might wish to count the words write and writing as a single frequency unit by using their stem form: write.

Although stemming is a useful heuristic, it is generally seen as crude because stemming merely cuts off the end of a word without consideration for whether the result is a real word. Some algorithms are more effective at handling this than others. Compare this to lemmatization, which takes word structure into an account in order to reduce a word to its lemma and thus retain more of the original meaning.

## Stemming with Natural Language Toolkit

The Natural Language Toolkit (NLTK) provides two stemmers, which employ slightly different algorithms. Note how the Porter and Snowball stemmers produce different results for a difficult word like “generally”.

import nltk
from nltk.stem import *
import pandas as pd

def nltk_stem_df(text):
porter = PorterStemmer()
snowball = SnowballStemmer("english")
rows = []
tokens = nltk.word_tokenize(text)
for idx, token in enumerate(tokens):
row = {
"word": token,
"porter": porter.stem(token),
"snowball": snowball.stem(token)
}
rows.append(row)
return pd.DataFrame(rows)

sentence = "Writers generally write because they love writing"
nltk_stem_df(sentence)

# 	word	    porter	snowball
# 0	Writers	    writer	writer
# 1	generally	gener	general
# 2	write	    write	write
# 3	because	    becaus	becaus
# 4	they	    they	they
# 5	love	    love	love
# 6	writing	    write	write
