Severin Perez

Reference: Stemming

August 23, 2020

In natural language processing (NLP), stemming is the process of reducing a word to its stem form. Typically, stemming is used as part of an NLP pipeline in order to reduce all words in a text to their stems so that they can be analyzed together. For example, when counting word frequency one might wish to count the words write and writing as a single frequency unit by using their stem form: write.

Although stemming is a useful heuristic, it is generally seen as crude because stemming merely cuts off the end of a word without consideration for whether the result is a real word. Some algorithms are more effective at handling this than others. Compare this to lemmatization, which takes word structure into an account in order to reduce a word to its lemma and thus retain more of the original meaning.

Stemming with Natural Language Toolkit

The Natural Language Toolkit (NLTK) provides two stemmers, which employ slightly different algorithms. Note how the Porter and Snowball stemmers produce different results for a difficult word like “generally”.

import nltk
from nltk.stem import *
import pandas as pd

def nltk_stem_df(text):
    porter = PorterStemmer()
    snowball = SnowballStemmer("english")
    rows = []
    tokens = nltk.word_tokenize(text)
    for idx, token in enumerate(tokens):
        row = {
            "word": token, 
            "porter": porter.stem(token),
            "snowball": snowball.stem(token)
    return pd.DataFrame(rows)

sentence = "Writers generally write because they love writing"

# 	word	    porter	snowball
# 0	Writers	    writer	writer
# 1	generally	gener	general
# 2	write	    write	write
# 3	because	    becaus	becaus
# 4	they	    they	they
# 5	love	    love	love
# 6	writing	    write	write

You might enjoy...

© Severin Perez, 2021