Exploring Literature with Stanza

August 23, 2020

The Stanford NLP Group has long been an active player in natural language processing, particularly through their well-known CoreNLP Java toolkit. Until recently though, Stanford NLP has been a less well-known player in the Python community, which is a shame since many NLP practitioners work primarily in Python. But there’s good news! Stanford NLP’s Stanza Python library is coming into its own with the recent release of version 1.1.1!

The new Stanza version supports 66 different human languages (which is a big step forward, since NLP has long been very English-centric) and can carry out core NLP tasks like lemmatization and named entity recognition. Stanza is also customizable, which means that users can build their own pipelines and train their own models.

So, for all you Pythonistas out there, let’s take a look at Stanza and what it can do. We’ll start with a brief overview of core Stanza functionality and then we’ll use it to explore the characters in the classic novel, Moby Dick.

Pipeline

The stanza Pipeline can be configured with a variety of options to select the language model, processors, etc. The language model must be downloaded before it can be used in a pipeline.

# load libraries
import stanza
import pandas as pd

# Download English language model and initialize the NLP pipeline.
stanza.download('en')
nlp = stanza.Pipeline('en')

For our exploratory project, let’s use the first paragraph from Moby Dick.

# Use default pipleline to create a Document object
moby_dick_para1 = "Call me Ishmael. Some years ago—never mind how long precisely—having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen and regulating the circulation. Whenever I find myself growing grim about the mouth; whenever it is a damp, drizzly November in my soul; whenever I find myself involuntarily pausing before coffin warehouses, and bringing up the rear of every funeral I meet; and especially whenever my hypos get such an upper hand of me, that it requires a strong moral principle to prevent me from deliberately stepping into the street, and methodically knocking people’s hats off—then, I account it high time to get to sea as soon as I can. This is my substitute for pistol and ball. With a philosophical flourish Cato throws himself upon his sword; I quietly take to the ship. There is nothing surprising in this. If they but knew it, almost all men in their degree, some time or other, cherish very nearly the same feelings towards the ocean with me."

moby_p1 = nlp(moby_dick_para1) # return a Document object

Data Objects

Documents

Stanza Document objects include text, tokens, words, dependencies and entities attributes. The tokens, words, dependencies and entities attributes are lists, and individual items can be accessed by index.

def print_doc_info(doc):
    print(f"Num sentences:\t{len(doc.sentences)}")
    print(f"Num tokens:\t{doc.num_tokens}")
    print(f"Num words:\t{doc.num_words}")
    print(f"Num entities:\t{len(doc.entities)}")

print_doc_info(moby_p1)

# Num sentences:    8
# Num tokens:       222
# Num words:        222
# Num entities:     3

Sentences

Each Sentence object contains doc, text, dependencies, tokens, words, and entities attributes. Individual items can be accessed by indexing the appropriate list.

def print_sentence_info(sentence):
    print(f"Text: {sentence.text}")
    print(f"Num tokens:\t{len(sentence.tokens)}")
    print(f"Num words:\t{len(sentence.words)}")
    print(f"Num entities:\t{len(sentence.entities)}")

print_sentence_info(moby_p1.sentences[0])

# Text:             Call me Ishmael.
# Num tokens:       4
# Num words:        4
# Num entities:     1

Tokens

Each Token object includes text, words, start_char, and end_char attributes, among others. In cases where a token is a multi-word token, the words attribute will contain each of the underlying words.

def print_token_info(token):
    print(f"Text:\t{token.text}")
    print(f"Start:\t{token.start_char}")
    print(f"End:\t{token.end_char}")

print_token_info(moby_p1.sentences[0].tokens[2])

# Text:     Ishmael
# Start:    8
# End:      15

Words

Each Word object includes various word-level annotations, as defiend by the various processors (parts-of-speech, lemmatization, etc.), including text, lemma, upos, xpos, feats, and others.

def print_word_info(word):
    print(f"Text:\t{word.text}")
    print(f"Lemma: \t{word.lemma}")
    print(f"UPOS: \t{word.upos}")
    print(f"XPOS: \t{word.xpos}")

print_word_info(moby_p1.sentences[3].words[4])

# Text:     growing
# Lemma:    grow
# UPOS:     VERB
# XPOS:     VBG

def word_info_df(doc):
    """
    - Parameters: doc (a Stanza Document object)
    - Returns: A Pandas DataFrame object with one row for each token in
      doc, and columns for text, lemma, upos, and xpos.
    """
    rows = []
    for sentence in doc.sentences:
        for word in sentence.words:
            row = {
                "text": word.text,
                "lemma": word.lemma,
                "upos": word.upos,
                "xpos": word.xpos,
            }
            rows.append(row)
    return pd.DataFrame(rows)

word_info_df(moby_p1)

# 	text	lemma	upos	xpos
# 0	Call    call	VERB	VB
# 1	me      I   	PRON	PRP
# 2	Ishmael Ishmael	PROPN	NNP
# 3	.       .	    PUNCT	.
# 4	Some	some	DET     DT

Entities

Stanza includes a built-in named entity recognition (NER) module, with options for extension and customization. The default pipeline includes the built-in NERProcessor, which recognizes named entities for all token spans. Each Entity object includes attributes for text, tokens, type, start_char, end_char, and others.

def print_entity_info(entity):
    print(f"Text:\t{entity.text}")
    print(f"Type:\t{entity.type}")
    print(f"Start:\t{entity.start_char}")
    print(f"End:\t{entity.end_char}")

print_entity_info(moby_p1.entities[0])

# Text:     Ishmael
# Type:     PERSON
# Start:    8
# End:      15

Sentiment Analysis

Stanza includes a built-in sentiment analysis processor, which can be customized as needed. Each Sentence object in a Document includes a sentiment score, where 0 represents negative, 1 represents neutral, and 2 represents positive. To make this a bit more human-readable, we’ll covert the scores to a string descriptor.

def sentiment_descriptor(sentence):
    """
    - Parameters: sentence (a Stanza Sentence object)
    - Returns: A string descriptor for the sentiment value of sentence.
    """
    sentiment_value = sentence.sentiment
    if (sentiment_value == 0):
        return "negative"
    elif (sentiment_value == 1):
        return "neutral"
    else:
        return "positive"

print(sentiment_descriptor(moby_p1.sentences[0]))

# neutral

def sentence_sentiment_df(doc):
    """
    - Parameters: doc (a Stanza Document object)
    - Returns: A Pandas DataFrame with one row for each sentence in doc,
      and columns for the sentence text and sentiment descriptor.
    """
    rows = []
    for sentence in doc.sentences:
        row = {
            "text": sentence.text,
            "sentiment": sentiment_descriptor(sentence)
        }
        rows.append(row)
    return pd.DataFrame(rows)

sentence_sentiment_df(moby_p1)

#   text                                                sentiment
# 0	Call me Ishmael.	                                neutral
# 1	Some years ago—never mind how long precisely—h...	neutral
# 2	It is a way I have of driving off the spleen a...	neutral
# 3	Whenever I find myself growing grim about the ...	negative
# 4	This is my substitute for pistol and ball.	        neutral
# 5	With a philosophical flourish Cato throws hims...	neutral
# 6	There is nothing surprising in this.	            neutral
# 7	If they but knew it, almost all men in their d...	neutral

Character Analysis

Now that we know a little bit about how to use Stanza, let’s use it to see if we can learn anything about the characters in Moby Dick. First, we’ll have to load up the full text. As many of you will remember, Moby Dick is a long novel, so putting it through the Stanza pipeline can take a while. If you happen to have access to GPUs though, Stanza is GPU-aware and the process will go much faster.

# load the full text and put it through the pipeline
def load_text_doc(file_path):
    with open(file_path) as f:
        txt = f.read()
    return txt

moby_path = "moby_dick.txt"
moby_dick_text = load_text_doc(moby_path)
moby_dick = nlp(moby_dick_text)

print_doc_info(moby_dick)

# Num sentences:	9966
# Num tokens:	    253928
# Num words:	    253928
# Num entities:	    7955

Moby Dick Characters

Lets use Stanza’s entity recognition function to identify all the characters in Moby Dick. We’ll do this by selecting only those entities that have the type PERSON. Since each entity points back to its containing sentence, we’ll go ahead and save the sentiment of that sentence for future use.

# select person entities
def select_person_entities(doc):
    return [ent for ent in doc.entities if ent.type == "PERSON"]

def person_df(doc):
    """
    - Parameters: doc (a Stanza Document object)
    - Returns: A Pandas DataFrame with one row for each entity in doc
      that has a "PERSON" type, and and columns text, type, start_char, 
      and the sentiment of the sentence in which the entity appears.
    """
    rows = []
    persons = select_person_entities(doc)
    for person in persons:
        row = {
            "text": person.text,
            "type": person.type,
            "start_char": person.start_char,
            "end_char": person.end_char,
            "sentence_sentiment": sentiment_descriptor(person._sent)
        }
        rows.append(row)
    return pd.DataFrame(rows)

characters = person_df(moby_dick)
display(characters.head())

#       text	        type	start_char	end_char	sentence_sentiment
# 0	    Ishmael	        PERSON	29	        36	        neutral
# 1	    Cato	        PERSON	890	        894	        neutral
# 2	    Tiger-lilies    PERSON	4226	    4238	    neutral
# 3	    Jove	        PERSON	4988	    4992	    neutral
# 4	    Narcissus	    PERSON	5080	    5089	    negative

Now that we have all of the characters from Moby Dick, we can start to analyze the data to see what we can learn about them. First, how many characters are there?

def num_unique_items(df, col):
    return len(df[col].unique())

num_unique_items(characters, "text")

# 699

Wow! 699 characters (or at least unique PERSON entities) is a lot. Most of those are mentioned just a single time, so perhaps we should take a look at just the major characters.

Character Counts

With our character dataframe in hand, we can now check which characters appear in the text most often. This will give us some idea about which characters are the most important.

# Which characters appear most frequently?
def frequency_count(df, col, limit=10):
    return df[col].value_counts().head(limit)

frequency_count(characters, "text")

# Ahab         474
# Stubb        224
# Queequeg     184
# Starbuck     140
# Jonah         81
# Moby Dick     75
# Bildad        70
# Peleg         69
# Pip           68
# Pequod        60

Unsurprisingly for anyone who has read Moby Dick, Captain Ahab is the most-mentioned character in the book. Other members of his crew like Stubb, Queequeg, and Starbuck make appearances in the most-frequent list as well. And of course, Moby Dick himself is in the top 10.

Character Sentiment

Since each Entity also includes a pointer to its parent sentence, we can now use the sentence sentiment rating that we saved earlier to make a judgement about the overall character sentiment. We’ll do this by converting our sentiment descriptors to a value of -1 for “negative”, 0 for “neutral”, and 1 for “positive”. After that, we can group the various appearances of each character and sum the sentiment value for each sentence the character appears in. A negative sum indicates a negative overall character sentiment, and a positive sum the opposite. And the farther from 0 the sum is, the stronger the sentiment.

# What is the sentiment surrounding each character?
def sentiment_descriptor_to_val(descriptor):
    """
    - Parameters: descriptor ("negative", "neutral", or "positive")
    - Returns: -1 for "negative", 0 for "neutral", 1 for "positive"
    """
    if descriptor == "negative":
        return -1
    elif descriptor == "neutral":
        return 0
    else:
        return 1

def character_sentiment(df):
    """
    - Parameters: df (Pandas DataFrame)
    - df must contain "text" and "sentiment_descriptor" columns.
    - Returns: 
    """
    sentiment = df.copy()
    sentiment["sentence_sentiment"] = [
        sentiment_descriptor_to_val(s) for s
        in sentiment["sentence_sentiment"]
    ]
    sentiment = sentiment[["text", "sentence_sentiment"]]
    sentiment = sentiment.groupby("text").sum().reset_index()
    
    return sentiment.sort_values("sentence_sentiment")

sentiment_df = character_sentiment(characters)

print("Characters in the most negative settings.")
display(sentiment_df.head(5))

# Characters in the most negative settings.
#       text    	sentence_sentiment
# 6	    Ahab	    -42
# 508	Queequeg	-24
# 317	Jonah	    -18
# 588	Stubb	    -11
# 468	Pequod	    -10

print("Characters in the most positive settings")
display(sentiment_df.tail(5))

# Characters in the most positive settings
#       text	    sentence_sentiment
# 1	    Abraham	    2
# 424	Monsieur	3
# 218	Gabriel	    3
# 401	Mary	    4
# 93	Bunger	    4

Phew. It would seem that Moby Dick is pretty grim! Almost no characters appear in majority positive sentences—and for those who do, the positivity is quite weak. As for Captain Ahab, his overall sentence sentiment sum is -42! Of course, we haven’t checked to see whether the sentiment is about Ahab, but merely the sentiment of sentences in which Ahab appears. Perhaps this is an indicator that Ahab lives a tortured and unhappy life—it would seem that he isn’t in a lot of happy sentences.

Next Steps

And that’s it for our quick look at Stanza! If you think Stanza could be a good fit for your needs, I highly encourage you to check it out—the documentation is excellent and has a good overview on usage. Perhaps you too can use it to explore your favorite novel. (And if you do, be sure to let us know the results!)

Severin Perez