Severin Perez

Reference: Lemmatization

August 21, 2020

Lemmatization is the process of reducing a word to its lemma (canonical form). In natural language processing (NLP), a lemmatizer may be used to reduce all words in a given text to their lemmas, which makes comparative analysis possible based on canonical forms, rather than comparing different forms of one word to one another. Unlike stemming, lemmatization aims to remove only the inflectional endings of a word, which leaves more of the original meaning intact. For example, a lemmatizer would reduce both run and ran to the lemma run, whereas a stemmer might reduce both to simply r.

Building lemmatization tools can be challenging because they must have a greater awareness of language structure. For example, saw becomes see and was becomes be. In these examples, identifying the lemma is non-trivial.

In modern NLP, there is some controversy as to whether lemmatization should be used when analyzing a text. Some practitioners and academics argue that lemmatization removes too much meaning from a text, which can make analysis less effective.

Common Lemmatization Tools

Most modern natural language processing libraries provide lemmatization functionality, including those described below. Note the subtle differences in each lemmatizer. For example, Spacy reduces ran to run whereas NLTK reduces ran to ran. And in the case of Gensim, it does not return lemmas for articles or pronouns.

Spacy

import spacy
import pandas as pd

def spacy_lemma_df(text):
    nlp = spacy.load("en_core_web_sm")
    doc = nlp(text)
    rows = []
    for token in doc:
        rows.append({"word": token.text, "lemma": token.lemma_})
    return pd.DataFrame(rows)

sentence = "The mice ran away when they saw the cats"
spacy_lemma_df(sentence)

#   word	lemma
# 0 The     the
# 1	mice    mouse
# 2	ran     run
# 3	away    away
# 4	when    when
# 5	they    -PRON-
# 6	saw     see
# 7	the     the
# 8	cats    cat

Natural Language Toolkit

import nltk
from nltk.stem import WordNetLemmatizer
import pandas as pd

def nltk_lemma_df(text):
    rows = []
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.word_tokenize(text)
    for token in tokens:
        row = {"word": token, "lemma": lemmatizer.lemmatize(token)}
        rows.append(row)
    return pd.DataFrame(rows)

sentence = "The mice ran away when they saw the cats"
nltk_lemma_df(sentence)

# 	word	lemma
# 0	The     The
# 1	mice    mouse
# 2	ran     ran
# 3	away    away
# 4	when    when
# 5	they    they
# 6	saw     saw
# 7	the     the
# 8	cats	cat

Gensim

import gensim
from gensim.utils import lemmatize

def gensim_lemma_df(text):
    rows = []
    words = text.split()
    for word in words:
        lemma_list = lemmatize(word)
        if len(lemma_list) == 0:
            lemma = ""
        else:
            lemma = lemma_list[0].decode("utf-8").split("/")[0]
        rows.append({"word": word, "lemma": lemma})
    return pd.DataFrame(rows)

sentence = "The mice ran away when they saw the cats"
gensim_lemma_df(sentence)

# 	word	lemma
# 0	The	
# 1	mice    mouse
# 2	ran     run
# 3	away	away
# 4	when	
# 5	they	
# 6	saw     see
# 7	the	
# 8	cats	cat

You might enjoy...


© Severin Perez, 2021