Reference: Lemmatization
August 21, 2020
Lemmatization is the process of reducing a word to its lemma (canonical form). In natural language processing (NLP), a lemmatizer may be used to reduce all words in a given text to their lemmas, which makes comparative analysis possible based on canonical forms, rather than comparing different forms of one word to one another. Unlike stemming, lemmatization aims to remove only the inflectional endings of a word, which leaves more of the original meaning intact. For example, a lemmatizer would reduce both run and ran to the lemma run, whereas a stemmer might reduce both to simply r.
Building lemmatization tools can be challenging because they must have a greater awareness of language structure. For example, saw becomes see and was becomes be. In these examples, identifying the lemma is non-trivial.
In modern NLP, there is some controversy as to whether lemmatization should be used when analyzing a text. Some practitioners and academics argue that lemmatization removes too much meaning from a text, which can make analysis less effective.
Common Lemmatization Tools
Most modern natural language processing libraries provide lemmatization functionality, including those described below. Note the subtle differences in each lemmatizer. For example, Spacy reduces ran to run whereas NLTK reduces ran to ran. And in the case of Gensim, it does not return lemmas for articles or pronouns.
Spacy
import spacy
import pandas as pd
def spacy_lemma_df(text):
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
rows = []
for token in doc:
rows.append({"word": token.text, "lemma": token.lemma_})
return pd.DataFrame(rows)
sentence = "The mice ran away when they saw the cats"
spacy_lemma_df(sentence)
# word lemma
# 0 The the
# 1 mice mouse
# 2 ran run
# 3 away away
# 4 when when
# 5 they -PRON-
# 6 saw see
# 7 the the
# 8 cats cat
Natural Language Toolkit
import nltk
from nltk.stem import WordNetLemmatizer
import pandas as pd
def nltk_lemma_df(text):
rows = []
lemmatizer = WordNetLemmatizer()
tokens = nltk.word_tokenize(text)
for token in tokens:
row = {"word": token, "lemma": lemmatizer.lemmatize(token)}
rows.append(row)
return pd.DataFrame(rows)
sentence = "The mice ran away when they saw the cats"
nltk_lemma_df(sentence)
# word lemma
# 0 The The
# 1 mice mouse
# 2 ran ran
# 3 away away
# 4 when when
# 5 they they
# 6 saw saw
# 7 the the
# 8 cats cat
Gensim
import gensim
from gensim.utils import lemmatize
def gensim_lemma_df(text):
rows = []
words = text.split()
for word in words:
lemma_list = lemmatize(word)
if len(lemma_list) == 0:
lemma = ""
else:
lemma = lemma_list[0].decode("utf-8").split("/")[0]
rows.append({"word": word, "lemma": lemma})
return pd.DataFrame(rows)
sentence = "The mice ran away when they saw the cats"
gensim_lemma_df(sentence)
# word lemma
# 0 The
# 1 mice mouse
# 2 ran run
# 3 away away
# 4 when
# 5 they
# 6 saw see
# 7 the
# 8 cats cat