Exploring Literature with Stanza
August 23, 2020
The Stanford NLP Group has long been an active player in natural language processing, particularly through their well-known CoreNLP Java toolkit. Until recently though, Stanford NLP has been a less well-known player in the Python community, which is a shame since many NLP practitioners work primarily in Python. But there’s good news! Stanford NLP’s Stanza Python library is coming into its own with the recent release of version 1.1.1!
The new Stanza version supports 66 different human languages (which is a big step forward, since NLP has long been very English-centric) and can carry out core NLP tasks like lemmatization and named entity recognition. Stanza is also customizable, which means that users can build their own pipelines and train their own models.
So, for all you Pythonistas out there, let’s take a look at Stanza and what it can do. We’ll start with a brief overview of core Stanza functionality and then we’ll use it to explore the characters in the classic novel, Moby Dick.
Pipeline
The stanza Pipeline
can be configured with a variety of options to select the language model, processors, etc. The language model must be downloaded before it can be used in a pipeline.
# load libraries
import stanza
import pandas as pd
# Download English language model and initialize the NLP pipeline.
stanza.download('en')
nlp = stanza.Pipeline('en')
For our exploratory project, let’s use the first paragraph from Moby Dick.
# Use default pipleline to create a Document object
moby_dick_para1 = "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off—then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."
moby_p1 = nlp(moby_dick_para1) # return a Document object
Data Objects
Documents
Stanza Document
objects include text
, tokens
, words
, dependencies
and entities
attributes. The tokens
, words
, dependencies
and entities
attributes are lists, and individual items can be accessed by index.
def print_doc_info(doc):
print(f"Num sentences:\t{len(doc.sentences)}")
print(f"Num tokens:\t{doc.num_tokens}")
print(f"Num words:\t{doc.num_words}")
print(f"Num entities:\t{len(doc.entities)}")
print_doc_info(moby_p1)
# Num sentences: 8
# Num tokens: 222
# Num words: 222
# Num entities: 3
Sentences
Each Sentence
object contains doc
, text
, dependencies
, tokens
, words
, and entities
attributes. Individual items can be accessed by indexing the appropriate list.
def print_sentence_info(sentence):
print(f"Text: {sentence.text}")
print(f"Num tokens:\t{len(sentence.tokens)}")
print(f"Num words:\t{len(sentence.words)}")
print(f"Num entities:\t{len(sentence.entities)}")
print_sentence_info(moby_p1.sentences[0])
# Text: Call me Ishmael.
# Num tokens: 4
# Num words: 4
# Num entities: 1
Tokens
Each Token
object includes text
, words
, start_char
, and end_char
attributes, among others. In cases where a token is a multi-word token, the words
attribute will contain each of the underlying words.
def print_token_info(token):
print(f"Text:\t{token.text}")
print(f"Start:\t{token.start_char}")
print(f"End:\t{token.end_char}")
print_token_info(moby_p1.sentences[0].tokens[2])
# Text: Ishmael
# Start: 8
# End: 15
Words
Each Word
object includes various word-level annotations, as defiend by the various processors (parts-of-speech, lemmatization, etc.), including text
, lemma
, upos
, xpos
, feats
, and others.
def print_word_info(word):
print(f"Text:\t{word.text}")
print(f"Lemma: \t{word.lemma}")
print(f"UPOS: \t{word.upos}")
print(f"XPOS: \t{word.xpos}")
print_word_info(moby_p1.sentences[3].words[4])
# Text: growing
# Lemma: grow
# UPOS: VERB
# XPOS: VBG
def word_info_df(doc):
"""
- Parameters: doc (a Stanza Document object)
- Returns: A Pandas DataFrame object with one row for each token in
doc, and columns for text, lemma, upos, and xpos.
"""
rows = []
for sentence in doc.sentences:
for word in sentence.words:
row = {
"text": word.text,
"lemma": word.lemma,
"upos": word.upos,
"xpos": word.xpos,
}
rows.append(row)
return pd.DataFrame(rows)
word_info_df(moby_p1)
# text lemma upos xpos
# 0 Call call VERB VB
# 1 me I PRON PRP
# 2 Ishmael Ishmael PROPN NNP
# 3 . . PUNCT .
# 4 Some some DET DT
Entities
Stanza includes a built-in named entity recognition (NER) module, with options for extension and customization. The default pipeline includes the built-in NERProcessor
, which recognizes named entities for all token spans. Each Entity
object includes attributes for text
, tokens
, type
, start_char
, end_char
, and others.
def print_entity_info(entity):
print(f"Text:\t{entity.text}")
print(f"Type:\t{entity.type}")
print(f"Start:\t{entity.start_char}")
print(f"End:\t{entity.end_char}")
print_entity_info(moby_p1.entities[0])
# Text: Ishmael
# Type: PERSON
# Start: 8
# End: 15
Sentiment Analysis
Stanza includes a built-in sentiment analysis processor, which can be customized as needed. Each Sentence
object in a Document
includes a sentiment
score, where 0
represents negative, 1
represents neutral, and 2
represents positive. To make this a bit more human-readable, we’ll covert the scores to a string descriptor.
def sentiment_descriptor(sentence):
"""
- Parameters: sentence (a Stanza Sentence object)
- Returns: A string descriptor for the sentiment value of sentence.
"""
sentiment_value = sentence.sentiment
if (sentiment_value == 0):
return "negative"
elif (sentiment_value == 1):
return "neutral"
else:
return "positive"
print(sentiment_descriptor(moby_p1.sentences[0]))
# neutral
def sentence_sentiment_df(doc):
"""
- Parameters: doc (a Stanza Document object)
- Returns: A Pandas DataFrame with one row for each sentence in doc,
and columns for the sentence text and sentiment descriptor.
"""
rows = []
for sentence in doc.sentences:
row = {
"text": sentence.text,
"sentiment": sentiment_descriptor(sentence)
}
rows.append(row)
return pd.DataFrame(rows)
sentence_sentiment_df(moby_p1)
# text sentiment
# 0 Call me Ishmael. neutral
# 1 Some years ago—never mind how long precisely—h... neutral
# 2 It is a way I have of driving off the spleen a... neutral
# 3 Whenever I find myself growing grim about the ... negative
# 4 This is my substitute for pistol and ball. neutral
# 5 With a philosophical flourish Cato throws hims... neutral
# 6 There is nothing surprising in this. neutral
# 7 If they but knew it, almost all men in their d... neutral
Character Analysis
Now that we know a little bit about how to use Stanza, let’s use it to see if we can learn anything about the characters in Moby Dick. First, we’ll have to load up the full text. As many of you will remember, Moby Dick is a long novel, so putting it through the Stanza pipeline can take a while. If you happen to have access to GPUs though, Stanza is GPU-aware and the process will go much faster.
# load the full text and put it through the pipeline
def load_text_doc(file_path):
with open(file_path) as f:
txt = f.read()
return txt
moby_path = "moby_dick.txt"
moby_dick_text = load_text_doc(moby_path)
moby_dick = nlp(moby_dick_text)
print_doc_info(moby_dick)
# Num sentences: 9966
# Num tokens: 253928
# Num words: 253928
# Num entities: 7955
Moby Dick Characters
Lets use Stanza’s entity recognition function to identify all the characters in Moby Dick. We’ll do this by selecting only those entities that have the type PERSON
. Since each entity points back to its containing sentence, we’ll go ahead and save the sentiment of that sentence for future use.
# select person entities
def select_person_entities(doc):
return [ent for ent in doc.entities if ent.type == "PERSON"]
def person_df(doc):
"""
- Parameters: doc (a Stanza Document object)
- Returns: A Pandas DataFrame with one row for each entity in doc
that has a "PERSON" type, and and columns text, type, start_char,
and the sentiment of the sentence in which the entity appears.
"""
rows = []
persons = select_person_entities(doc)
for person in persons:
row = {
"text": person.text,
"type": person.type,
"start_char": person.start_char,
"end_char": person.end_char,
"sentence_sentiment": sentiment_descriptor(person._sent)
}
rows.append(row)
return pd.DataFrame(rows)
characters = person_df(moby_dick)
display(characters.head())
# text type start_char end_char sentence_sentiment
# 0 Ishmael PERSON 29 36 neutral
# 1 Cato PERSON 890 894 neutral
# 2 Tiger-lilies PERSON 4226 4238 neutral
# 3 Jove PERSON 4988 4992 neutral
# 4 Narcissus PERSON 5080 5089 negative
Now that we have all of the characters from Moby Dick, we can start to analyze the data to see what we can learn about them. First, how many characters are there?
def num_unique_items(df, col):
return len(df[col].unique())
num_unique_items(characters, "text")
# 699
Wow! 699 characters (or at least unique PERSON
entities) is a lot. Most of those are mentioned just a single time, so perhaps we should take a look at just the major characters.
Character Counts
With our character dataframe in hand, we can now check which characters appear in the text most often. This will give us some idea about which characters are the most important.
# Which characters appear most frequently?
def frequency_count(df, col, limit=10):
return df[col].value_counts().head(limit)
frequency_count(characters, "text")
# Ahab 474
# Stubb 224
# Queequeg 184
# Starbuck 140
# Jonah 81
# Moby Dick 75
# Bildad 70
# Peleg 69
# Pip 68
# Pequod 60
Unsurprisingly for anyone who has read Moby Dick, Captain Ahab is the most-mentioned character in the book. Other members of his crew like Stubb, Queequeg, and Starbuck make appearances in the most-frequent list as well. And of course, Moby Dick himself is in the top 10.
Character Sentiment
Since each Entity
also includes a pointer to its parent sentence, we can now use the sentence sentiment rating that we saved earlier to make a judgement about the overall character sentiment. We’ll do this by converting our sentiment descriptors to a value of -1
for “negative”, 0
for “neutral”, and 1
for “positive”. After that, we can group the various appearances of each character and sum the sentiment value for each sentence the character appears in. A negative sum indicates a negative overall character sentiment, and a positive sum the opposite. And the farther from 0 the sum is, the stronger the sentiment.
# What is the sentiment surrounding each character?
def sentiment_descriptor_to_val(descriptor):
"""
- Parameters: descriptor ("negative", "neutral", or "positive")
- Returns: -1 for "negative", 0 for "neutral", 1 for "positive"
"""
if descriptor == "negative":
return -1
elif descriptor == "neutral":
return 0
else:
return 1
def character_sentiment(df):
"""
- Parameters: df (Pandas DataFrame)
- df must contain "text" and "sentiment_descriptor" columns.
- Returns:
"""
sentiment = df.copy()
sentiment["sentence_sentiment"] = [
sentiment_descriptor_to_val(s) for s
in sentiment["sentence_sentiment"]
]
sentiment = sentiment[["text", "sentence_sentiment"]]
sentiment = sentiment.groupby("text").sum().reset_index()
return sentiment.sort_values("sentence_sentiment")
sentiment_df = character_sentiment(characters)
print("Characters in the most negative settings.")
display(sentiment_df.head(5))
# Characters in the most negative settings.
# text sentence_sentiment
# 6 Ahab -42
# 508 Queequeg -24
# 317 Jonah -18
# 588 Stubb -11
# 468 Pequod -10
print("Characters in the most positive settings")
display(sentiment_df.tail(5))
# Characters in the most positive settings
# text sentence_sentiment
# 1 Abraham 2
# 424 Monsieur 3
# 218 Gabriel 3
# 401 Mary 4
# 93 Bunger 4
Phew. It would seem that Moby Dick is pretty grim! Almost no characters appear in majority positive sentences—and for those who do, the positivity is quite weak. As for Captain Ahab, his overall sentence sentiment sum is -42! Of course, we haven’t checked to see whether the sentiment is about Ahab, but merely the sentiment of sentences in which Ahab appears. Perhaps this is an indicator that Ahab lives a tortured and unhappy life—it would seem that he isn’t in a lot of happy sentences.
Next Steps
And that’s it for our quick look at Stanza! If you think Stanza could be a good fit for your needs, I highly encourage you to check it out—the documentation is excellent and has a good overview on usage. Perhaps you too can use it to explore your favorite novel. (And if you do, be sure to let us know the results!)