Logo Severin Perez

All About Entropy

February 1, 2025
37 min read
All About Entropy
Table of Contents

In this article, we’re going to do a deep dive into the idea of entropy, one of the most important concepts in information theory, and a critical tool in text analytics. We’ll review how to calculate entropy, some nuances related to measuring entropy in texts, and of course, lots of code along the way.

What is Entropy?

Put simply, entropy is a measure of uncertainty. Imagine picking numbers from a lottery machine. If every ball in the machine had the number 4 written on it, then you would know with 100% certainty that no matter which ball you pick, it would be the number 4. In other words, the uncertainty would be zero, and the entropy would also be zero. As you add more balls, with different numbers, and in different quantities, the uncertainty goes up. And so does the entropy. To use slightly more technical terms, entropy measures the uncertainty in a probability distribution. If you make a random observation of the distribution, entropy tells you how likely you are to be able to guess the outcome ahead of time. With low entropy, it’s very easy to guess (as with the lottery machine that only has 4’s). With high entropy, it’s very difficult.

Although the term “entropy” goes back to the mid-19th century, when it was first used to describe disorder in thermodynamic systems, the term’s history in information theory is more recent. Mathematician Claude Shannon first developed the concept in 1948, in his paper titled A Mathematical Theory of Communication. Shannon was working on how to encode information for transmission (and secrecy!) and entropy was a key idea in understanding how to compress messages without losing any information.

Entropy relates to all kinds of information, but right from the start, Shannon recognized its applicability to language. In his 1951 paper, Prediction and Entropy of Printed English, Shannon used entropy to evaluate the redundancy of a language, and thus its predictability. For our purposes in text analytics, we can use entropy as a way of evaluating the complexity and information content of a text. A document that uses a limited vocabulary, with simple grammatical constructions, is a document with low entropy, because it is relatively predictable.

Consider the sentence “I like red apples, and I like yellow bananas.” If we masked the colors, as in “I like ____ apples, and ____ bananas”, most readers could predict them after a guess or two, especially once they identified the pattern of “I like (color) (fruit).” In this way, we can see that it’s a sentence with low entropy. Compare this to an unusual sentence like “Quixotic chrysanthemums tango beneath stars that smelled of purple.” Here we have rare words in unexpected combinations. If we masked the word “purple”, almost no one would guess it on the first try. By comparison to our first sentence, this one has much higher entropy.

Now that we know a bit about entropy, let’s look at how to calculate it. Formally, entropy is defined for discrete random variable XX:

H(X)=i=1np(xi)log2p(xi)H(X) = -\sum_{i=1}^n p(x_i) \log_2 p(x_i)

where:

  • p(xi)p(x_i) is the probability of occurrence of event xix_i;
  • nn is the number of possible events; and,
  • The logarithm is base 2, meaning we’re measuring entropy in bits. (We could measure in other units, but for now let’s stick with bits.)

Let’s see if we can understand what is happening by implementing the formula in Python.

Character Entropy

In our first implementation, we’ll calculate entropy based on characters in a string, using the character-level probability distribution of the string itself as context. We’ll come back later to why the context matters.

First, let’s define a Python function to calculate entropy.

def char_entropy(text: str) -> float:
"""
Calculates the character-level Shannon entropy of a text, in bits,
using probability distribution of characters in the text.
Args:
text: str: The input text.
Returns:
float: The Shannon entropy of the text.
"""
# calculate the frequency of each character in the text
freqs = Counter(text)
# calculate the entropy using the formula -sum(p * log2(p))
# where p is the probability of each character
total_chars = len(text)
entropy = 0.0
for freq in freqs.values():
prob = freq / total_chars
entropy -= prob * log2(prob)
return entropy

Now that we have our function, let’s try it out on a few simple strings.

all_a = "aaaaaa"
ab_repeat = "ababab"
first_letters = "abcdef"
rare_letters = "jkqvxz"
sents = [all_a, ab_repeat, first_letters, rare_letters]
for s in sents:
print(f"{s} (entropy: {char_entropy(s)})")
# ---- Output -----
# aaaaaa (entropy: 0.0)
# ababab (entropy: 1.0)
# abcdef (entropy: 2.584962500721156)
# jkqvxz (entropy: 2.584962500721156)

At the character level, our results show us that the string aaaaaa has an entropy of 0.0, meaning that it has no uncertainty, and thus contains no information. Imagine that you are randomly selecting letters to print, but you’re selecting the from an alphabet that only consists of the letter a. Given that there is only a single outcome (the letter a being selected), we have zero uncertainty and can accurately predict the outcome every time.

By comparison, the string ababab has higher entropy (1.0) because its alphabet consists of two letters, and the probability of drawing each is 50% (we’ll assume a uniform distribution for now). This means that we have an element of uncertainty, because rather than ababab, we might well have ended up with aaabbb or aaabaa and many other combinations. And once we expand the alphabet to six letters, as in abcdef or jkqvxz, the entropy goes up even higher to 2.584.

There is, however, a problem with this approach. We know that in the real world, the alphabet consists of more than one, two, or six letters. The English alphabet has 26 letters, plus lowercase and capital forms, plus punctuation marks. What happens if we take the whole alphabet into consideration?

def char_entropy_with_dist(text: str, prob_dist: dict[str, float], verbose: bool = False) -> float:
"""
Calculates the Shannon entropy of a text string, given a probability
distribution for lowercase letters.
Args:
text: The input text string (letters only).
prob_dist: A dictionary mapping lowercase letters to their
probabilities.
Returns:
A dictionary with the entropy and total information content of the text.
"""
if re.search(r'[^A-Za-z]', text): # validate input
print("Error: Input contains non-letter characters")
raise ValueError("Input text must contain only letters A-Z or a-z")
text = text.lower() # ensure lowercase
if verbose:
print(f"Input text (lowercase): {text} (len: {len(text)})")
entropy = 0
information = 0
for char in text:
if char in prob_dist:
prob = prob_dist[char]
char_info = -np.log2(prob)
contribution = prob * char_info
entropy = entropy + contribution
information = information + char_info
if verbose:
print(f"Character: {char}, Probability: {prob:.6f}, Information: {char_info:.6f}, Contribution: {contribution:.6f}")
if verbose:
print(f"Final entropy: {entropy}")
print(f"Total information: {information}\n")
return {"entropy": entropy, "information": information}

In our new entropy function, we accept a probability distribution as an argument. The distribution allows us to provide context to the calculation, rather than assuming that the input text includes the entirety of available options. Now, when we calculate entropy of aaaaaa, we can do so knowing that a is just one letter out of 26 in the English alphabet. First, let’s assume that we have a uniform distribution of letter frequency in the alphabet, meaning each character is equally likely to be selected.

uniform_dist = {char: 1/26 for char in 'abcdefghijklmnopqrstuvwxyz'}
print("Uniform alphabet distribution:\n")
for s in sents:
char_entropy_with_dist(s, uniform_dist, verbose=True)
Uniform alphabet distribution:
Input text (lowercase): aaaaaa (len: 6)
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Final entropy: 1.0847168580325597
Total information: 28.20263830884655
Input text (lowercase): ababab (len: 6)
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: b, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: b, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: b, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Final entropy: 1.0847168580325597
Total information: 28.20263830884655
Input text (lowercase): abcdef (len: 6)
Character: a, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: b, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: c, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: d, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: e, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: f, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Final entropy: 1.0847168580325597
Total information: 28.20263830884655
Input text (lowercase): jkqvxz (len: 6)
Character: j, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: k, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: q, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: v, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: x, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Character: z, Probability: 0.038462, Information: 4.700440, Contribution: 0.180786
Final entropy: 1.0847168580325597
Total information: 28.20263830884655

With a uniform distribution, we see that all of our strings have the same entropy! If each letter is equally likely to be selected, then each string of the same length is equally likely. This makes the uncertainty (or “surprise”) of each outcome the same. Every string still contains some information, given that there are 26626^6 potential outcomes in a six-character string, but no single string is more likely than another.

Before moving on, we should take a moment to break down the formula for entropy, as it contains a few important elements. The entropy is the sum of the weighted information content (aka uncertainty, or surprise) of each character in the string. The information content of a given character is log2(p(x))-log_2(p(x)). By intuition, we know that an outcome with a lower probability is more “surprising” and therefore should provide more information than an outcome with higher probability. The logarithm serves two purposes: first, for values between 0 and 1 (that is, all probabilities), it produces larger values for smaller inputs (matching our intuition for greater surprise with smaller probabilities); and second, since the logarithm function is non-linear, very rare events carry more weight than semi-rare events. Finally, we invert the sign because the logarithm of a value between 0 and 1 is always negative, and we want our information measure to be positive. With this, we can say that the information content II, measured on scale bb (usually 2, for bits), of a value xx is:

I(x)=logb(p(x))I(x) = -log_b(p(x))

Going back to our uniform distribution, we observed that every letter had the same information content and contribution to entropy. But, we know that’s not how language works. In every language, certain letters are used more frequently than others, because langauges have patterns, and rules that dictate spelling. In English for example, we know that q is a low frequency letter (hence the high score in Scrabble!) and e is very common. So what happens if we use a more realistic distribution, based on actual observations of the English language? Let’s find out, using this handy real-world distribution provided by the University of Notre Dame.

observed_dist = {
'a': 0.084966, 'b': 0.020720, 'c': 0.045388, 'd': 0.033844,
'e': 0.111607, 'f': 0.018121, 'g': 0.024705, 'h': 0.030034,
'i': 0.075448, 'j': 0.001965, 'k': 0.011016, 'l': 0.054893,
'm': 0.030129, 'n': 0.066544, 'o': 0.071635, 'p': 0.031671,
'q': 0.001962, 'r': 0.075809, 's': 0.057351, 't': 0.069509,
'u': 0.036308, 'v': 0.010074, 'w': 0.012899, 'x': 0.002902,
'y': 0.017779, 'z': 0.002722
}
print("\nObserved distribution:")
for s in sents:
char_entropy_with_dist(s, observed_dist, verbose=True)
Observed distribution:
Input text (lowercase): aaaaaa (len: 6)
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Final entropy: 1.8133293544228715
Total information: 21.341823251922786
Input text (lowercase): ababab (len: 6)
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Character: b, Probability: 0.020720, Information: 5.592832, Contribution: 0.115883
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Character: b, Probability: 0.020720, Information: 5.592832, Contribution: 0.115883
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Character: b, Probability: 0.020720, Information: 5.592832, Contribution: 0.115883
Final entropy: 1.2543151259398315
Total information: 27.449408186212164
Input text (lowercase): abcdef (len: 6)
Character: a, Probability: 0.084966, Information: 3.556971, Contribution: 0.302222
Character: b, Probability: 0.020720, Information: 5.592832, Contribution: 0.115883
Character: c, Probability: 0.045388, Information: 4.461545, Contribution: 0.202501
Character: d, Probability: 0.033844, Information: 4.884956, Contribution: 0.165326
Character: e, Probability: 0.111607, Information: 3.163501, Contribution: 0.353069
Character: f, Probability: 0.018121, Information: 5.786194, Contribution: 0.104852
Final entropy: 1.2438525366833209
Total information: 27.44599829714204
Input text (lowercase): jkqvxz (len: 6)
Character: j, Probability: 0.001965, Information: 8.991255, Contribution: 0.017668
Character: k, Probability: 0.011016, Information: 6.504256, Contribution: 0.071651
Character: q, Probability: 0.001962, Information: 8.993459, Contribution: 0.017645
Character: v, Probability: 0.010074, Information: 6.633220, Contribution: 0.066823
Character: x, Probability: 0.002902, Information: 8.428737, Contribution: 0.024460
Character: z, Probability: 0.002722, Information: 8.521117, Contribution: 0.023194
Final entropy: 0.22144159306730116
Total information: 48.07204347721594

As you can see, the distribution makes a huge difference in the entropy of our strings. Now, instead of them all being the same (as with the uniform distribution), each string has a different entropy. Counterintuitively though, we see that the string aaaaaa actually has the highest entropy, even though a is the second most common letter, and thus contains very little information. Why is this?

If we look back at our entropy formula, we can substitute I(xi)I(x_i) for log2(p(xi))-log_2(p(x_i)) to see how information content relates to entropy more clearly:

H(X)=i=1np(xi) I(xi)H(X) = \sum_{i=1}^n p(x_i) \ I(x_i)

As we can see, the entropy is not just the sum of information in the string, but the weighted sum, where the information content of each character is weighted by its probability. In other words, very high probability letters like a and e have greater weight than low probability letters like q or z.

At this point, we should clarify what the entropy of a string really means. Entropy is a measure of the average information across all events in a probability distribution. It answers the question, “if I draw from this distribution, on average, how much information should I expect to get out?” So, if we calculate the entropy of a string, what we’re really doing is saying something about that string’s relationship to the overall distribution, rather than something about the string itself.

This raises an important question: in text analytics, should we be measuring the entropy of a string at all? Isn’t the total information content more important? Reasonable minds could argue this point, and different use cases probably prefer different calculations; however, it’s worth noting that the two measurements, though related, tell us something different. The entropy of a text is primarily a question of uncertainty. On average, a text with high entropy is likely to have high information content. Total information though is a function of length, and of information efficiency (how much information, on average, do we pack into each symbol in a text.) It’s a nuanced difference, but meaningful.

As an aside, but beyond the scope of this article, we should acknowledge that our model still doesn’t quite match the real world. Just as we know that real alphabets have non-uniform probability distributions for letter frequency, we also know that each letter is not independent of the last. In our calculations thus far, we have assumed independence. In reality, the probability of each letter should change based off previously observed letters. For example, we know that in English the letter q is very likely to be followed by the letter u, meaning that we should use a different probability for u in that setting than its independent frequency. This type of dependency is often modeled using tools like Markov chains or n-grams, which take into account the probabilities of sequences rather than isolated events. But, that is a discussion for another day.

Just to drive the point home, let’s plot the information content of each letter from each distribution.

def plot_letter_information(distributions: dict | list[dict],
labels: list[str] = None,
title: str = "Information Content of Letters") -> None:
"""
Plot the information content of letters given one or more probability distributions.
Args:
distributions: Single dictionary or list of dictionaries mapping letters to probabilities
labels: List of labels for each distribution (only used if multiple distributions provided)
title: Title for the plot
"""
if isinstance(distributions, dict):
distributions = [distributions]
labels = ['Distribution']
elif labels is None:
labels = [f'Distribution {i+1}' for i in range(len(distributions))]
plt.figure(figsize=(12, 6))
width = 0.8 / len(distributions)
for i, dist in enumerate(distributions):
letters = list(dist.keys())
probs = list(dist.values())
info_content = [-log2(p) for p in probs]
x = np.arange(len(letters))
offset = width * (i - (len(distributions)-1)/2)
plt.bar(x + offset, info_content, width, label=labels[i], alpha=0.7)
plt.title(title)
plt.xlabel("Letter")
plt.ylabel("Information Content (bits)")
plt.xticks(range(len(letters)), letters)
if len(distributions) > 1:
plt.legend()
ymax = max(max(-log2(p) for p in dist.values()) for dist in distributions)
plt.ylim(0, ymax * 1.1)
plt.show()
plot_letter_information(
[observed_dist, uniform_dist],
labels=["Observed distribution", "Uniform distribution"],
title="Information content of English letters")

Bar chart of information content per letter.

Here, we can clearly see that in the observed distribution (the one drawn from actual English texts), certain letters have far more information content than others. This shows us how important the probability distribution is to the entropy of a text, and the information content of its constituent parts.

And speaking of probabilities and information content, it’s worth taking a look at how the former affects the latter. We can do so by plotting our observed distribution frequencies against the information content for each letter.

def plot_information_vs_probability(dist: dict) -> None:
letters = list(dist.keys())
freqs = np.array(list(dist.values()))
info_content = -np.log2(freqs)
plt.figure(figsize=(10, 6))
sns.scatterplot(x=freqs, y=info_content, color='b', alpha=0.6)
for i, (letter, x, y) in enumerate(zip(letters, freqs, info_content)):
offset = 10 if i % 2 == 0 else -15
plt.annotate(letter, (x, y), xytext=(0, offset),
textcoords='offset points', fontsize=10, ha='center')
plt.xlabel("Probability")
plt.ylabel("Information Content (bits)")
plt.title("Information Content vs. Probability of English Letters")
plt.ylim(bottom=0, top=max(info_content) * 1.1)
sns.despine()
plt.show()
plot_information_vs_probability(observed_dist)

Information content vs. probability plot for English letters.

The above plot makes it clear that as probability goes up, information content goes down. This is of course exactly what we expect, given that information content is calculated as I(x)=log2(p(x))I(x)=-log_2(p(x)), meaning it has a negative logarithmic slope.

What About Words?

Thus far we have been talking about entropy in the context of characters. Although useful for illustrative purposes, or to understand probability distributions within an alphabet, character-based entropy isn’t particularly interesting in most text analytics applications. In general, our main concern isn’t with the information contained in letters, but the information in words.

We know that entropy is calculated as the sum of weighted information content for each observation drawn from a random variable. In the context of character-based entropy, the range of possible observations was the lowercase alphabet, and the probability distribution was the frequency those letters appear in a language using that alphabet. Now, in the context of word-based entropy, each observation is a single word, drawn from a range that consists of all words in a given vocabulary, and the frequency of those words as they appear in some corpus of texts. Ideally, it would be nice to have the entropy of an entire language, but given how languages change over time, and the impracticality of observing every use of a word in any context, written or spoken, anywhere, over all of time, we have to settle for vocabularies and frequencies as defined by a particular collection of texts.

So, where do we get these vocabularies and probability distributions? Well, to start, we could make our own. Consider the following simple sentences.

fruit_sentences = [
"I like red apples, and I like yellow bananas.",
"I eat apples, oranges, or bananas, daily.",
"I like bananas more than red apples or oranges.",
"I sometimes pick my own apples and oranges.",
"My dog Rex likes apples and bananas."
]
fruit_text = " ".join(fruit_sentences).lower()
# remove punctuation
fruit_text = re.sub(r'[^\w\s]', '', fruit_text)
# build vocabulary and frequency distribution
fruit_freqs = Counter(fruit_text.split())
# calculate probabilities
total_words = sum(fruit_freqs.values())
fruit_probs = {word: freq / total_words for word, freq in fruit_freqs.items()}
fruit_df = pd.DataFrame({
"word": list(fruit_freqs.keys()),
"freq": list(fruit_freqs.values()),
"prob": list(fruit_probs.values()),
"info": [-np.log2(p) for p in fruit_probs.values()],
"entropy_contribution": [p * -np.log2(p) for p in fruit_probs.values()]
}).sort_values("freq", ascending=False).reset_index(drop=True)
fruit_df
wordfreqprobinfoentropy_contribution
i50.1253.0000000.375000
apples50.1253.0000000.375000
bananas40.1003.3219280.332193
and30.0753.7369660.280272
oranges30.0753.7369660.280272
like30.0753.7369660.280272
red20.0504.3219280.216096
or20.0504.3219280.216096
my20.0504.3219280.216096
pick10.0255.3219280.133048
rex10.0255.3219280.133048
dog10.0255.3219280.133048
own10.0255.3219280.133048
daily10.0255.3219280.133048
sometimes10.0255.3219280.133048
than10.0255.3219280.133048
more10.0255.3219280.133048
eat10.0255.3219280.133048
yellow10.0255.3219280.133048
likes10.0255.3219280.133048

In this corpus of five sentences, our vocabulary has 20 words, with words like I, apples, and bananas having high probability. If we randomly draw a word from this corpus, we wouldn’t be particularly surprised if we drew apples, because it is high frequency. So, in the context of this corpus, apples has relatively low information content. By comparison, the word dog only occurs a single time, and thus has much higher information content. Let’s use the same methodology as before to calculate entropy for this corpus.

def calculate_corpus_entropy(
corpus: list[str],
prob_func: Optional[Callable] = None
) -> dict:
"""
Calculate entropy and information metrics for a corpus of text.
Args:
corpus: A list of strings representing the texts to analyze
prob_func: A function to calculate the probability of a word given its frequency.
If None, the probability is calculated as the freq / total_words.
Returns:
dict containing:
- entropy: The total entropy of the corpus
- information_content: The total information content
- word_data: DataFrame with per-word statistics
"""
# join all texts and convert to lowercase
full_text = " ".join(corpus).lower()
# remove punctuation and split into words
words = re.sub(r'[^\w\s]', '', full_text).split()
# calculate frequencies
freqs = Counter(words)
total_words = sum(freqs.values())
# calculate metrics for each word
word_data = []
total_entropy = 0
total_information = 0
for word, freq in freqs.items():
prob = prob_func(word) if prob_func else freq / total_words
information = -np.log2(prob) # information content in bits
entropy_contribution = prob * information
word_data.append({
'word': word,
'frequency': freq,
'probability': prob,
'information': information,
'entropy_contribution': entropy_contribution
})
total_entropy += entropy_contribution
total_information += information
word_df = pd.DataFrame(word_data).sort_values('frequency', ascending=False).reset_index(drop=True)
return {
'entropy': total_entropy,
'information_content': total_information,
'total_words': total_words,
'total_vocab': len(freqs),
'word_data': word_df
}
fruit_data = calculate_corpus_entropy(fruit_sentences)
print(f"Total entropy: {fruit_data['entropy']:.4f}")
print(f"Total information content: {fruit_data['information_content']:.4f}")
print(f"Total words: {fruit_data['total_words']}")
print(f"Total vocabulary: {fruit_data['total_vocab']}")
fruit_data['word_data']
Total entropy: 4.0348
Total information content: 92.0398
Total words: 40
Total vocabulary: 20
wordfrequencyprobabilityinformationentropy_contribution
i50.1253.0000000.375000
apples50.1253.0000000.375000
bananas40.1003.3219280.332193
and30.0753.7369660.280272
oranges30.0753.7369660.280272
like30.0753.7369660.280272
red20.0504.3219280.216096
or20.0504.3219280.216096
my20.0504.3219280.216096
pick10.0255.3219280.133048
rex10.0255.3219280.133048
dog10.0255.3219280.133048
own10.0255.3219280.133048
daily10.0255.3219280.133048
sometimes10.0255.3219280.133048
than10.0255.3219280.133048
more10.0255.3219280.133048
eat10.0255.3219280.133048
yellow10.0255.3219280.133048
likes10.0255.3219280.133048

Here, we see that our corpus of texts, fruit_sentences, has an entropy of 4.03484.0348 and total information content of 92.039892.0398. As expected, the rare words like dog, rex, and yellow contribute more information content, but since entropy contributions are weighted by frequency, high probability words like apples and bananas end up contributing more total entropy.

Of course, this is a very small corpus, with a very small vocabulary. Let’s try calculating entropy for a corpus with a much larger vocabulary. To do this, let’s take a look at the vocabulary in Crime and Punishment. After downloading the text, we’ll run it through the same function

def read_text_from_url(url: str, encoding: str = 'utf-8') -> str:
"""
Reads text content from a URL with specified encoding.
Args:
url (str): The URL of the text file to read
encoding (str, optional): The encoding to use when reading the text. Defaults to 'utf-8'.
Returns:
str: The text content from the URL
Raises:
requests.RequestException: If the request fails
"""
try:
response = requests.get(url)
response.raise_for_status() # Raise an exception for bad status codes
response.encoding = encoding # Set the encoding
return response.text
except requests.RequestException as e:
print(f"Error fetching text from URL: {e}")
raise
# download text
crime_url = "https://www.gutenberg.org/ebooks/2554.txt.utf-8"
crime_text = read_text_from_url(crime_url)
# trim metadata
start_marker = "*** START OF THE PROJECT GUTENBERG EBOOK CRIME AND PUNISHMENT ***"
crime_text = crime_text[crime_text.find(start_marker) + len(start_marker):].strip()
# extract single chapter
crime_chp_1 = crime_text[crime_text.find("CHAPTER I"):crime_text.find("CHAPTER II")].strip()

Before diving into the whole text, let’s try out our methodology on a single chapter. This will let us compare calculations from a single chapter to that for the full text, and see how probability distributions affect information content and entropy for particular words.

crime_chp_1_data = calculate_corpus_entropy([crime_chp_1])
print(f"Total entropy: {crime_chp_1_data['entropy']:.4f}")
print(f"Total information content: {crime_chp_1_data['information_content']:.4f}")
print(f"Total words: {crime_chp_1_data['total_words']}")
print(f"Total vocabulary: {crime_chp_1_data['total_vocab']}")
print("\nWord data:")
crime_chp_1_data['word_data'].head(20)
Total entropy: 8.3847
Total information content: 11127.1207
Total words: 3328
Total vocabulary: 1013
wordfrequencyprobabilityinformationentropy_contribution
the1860.0558894.1612810.232572
and1140.0342554.8675500.166737
a1070.0321514.9589730.159438
he1000.0300485.0565840.151941
in820.0246395.3428880.131646
of770.0231375.4336530.125719
to690.0207335.5919150.115938
was600.0180295.7935490.104451
it430.0129216.2741750.081067
his390.0117196.4150370.075176
that380.0114186.4525120.073677
at340.0102166.6129770.067560
i320.0096156.7004400.064427
as300.0090146.7935490.061240
had300.0090146.7935490.061240
on290.0087146.8424590.059625
with290.0087146.8424590.059625
but260.0078127.0000000.054688
for240.0072127.1154770.051314
all240.0072127.1154770.051314

Looking at the first chapter only, we get a total entropy of 8.38478.3847. Remember, this is the entropy of the probability distribution seen in this chapter only. We’ll see in a moment how it changes when we look at the full text. But already we can see it’s much higher than the entropy of our fruit sentences, because the vocabulary is much bigger and there is thus more uncertainty of which word will come out when we make a random selection.

It’s worth pointing out as well that, unsurprisingly, the most frequent words are so-called stopwords, that is, filler words that are important to the structure of a language, but not particularly original or unique. Let’s try sorting by information content.

crime_chp_1_data['word_data'].sort_values("information", ascending=False).head(20)
wordfrequencyprobabilityinformationentropy_contribution
waking10.00030011.7004400.003516
bustle10.00030011.7004400.003516
insufferable10.00030011.7004400.003516
overwrought10.00030011.7004400.003516
learned10.00030011.7004400.003516
lying10.00030011.7004400.003516
together10.00030011.7004400.003516
den10.00030011.7004400.003516
jack10.00030011.7004400.003516
giantkiller10.00030011.7004400.003516
fantasy10.00030011.7004400.003516
amuse10.00030011.7004400.003516
myself10.00030011.7004400.003516
maybe10.00030011.7004400.003516
airlessness10.00030011.7004400.003516
plaster10.00030011.7004400.003516
particularly10.00030011.7004400.003516
scaffolding10.00030011.7004400.003516
bricks10.00030011.7004400.003516
unable10.00030011.7004400.003516

Here we see some more interesting words, like giantkiller, and fantasy. These words are very low probability, and thus have very high information content. Looking only at the first chapter, we can see that any word that appeared a single time had an information content of 11.700. We’ll see how that changes with a larger vocabulary shortly.

Now, let’s turn to the full text of Crime and Punishment and see how our numbers change, given the much larger vocabulary, and thus a different probability distribution.

crime_data = calculate_corpus_entropy([crime_text])
print(f"Total entropy: {crime_data['entropy']:.4f}")
print(f"Total information content: {crime_data['information_content']:.4f}")
print(f"Total words: {crime_data['total_words']}")
print(f"Total vocabulary: {crime_data['total_vocab']}")
print("\nWord data:")
crime_data['word_data'].head(20)
Total entropy: 9.3700
Total information content: 174062.3333
Total words: 206390
Total vocabulary: 10800
wordfrequencyprobabilityinformationentropy_contribution
the79760.0386454.6935640.181384
and69680.0337614.8884850.165042
to53420.0258835.2718490.136451
he46520.0225405.4713780.123324
a46220.0223945.4807120.122738
i39360.0190715.7124990.108941
of39210.0189985.7180080.108631
you38730.0187655.7357780.107634
in32420.0157085.9923450.094129
it29830.0144536.1124650.088345
that29150.0141246.1457330.086801
was28210.0136686.1930230.084648
his21130.0102386.6099360.067672
at20750.0100546.6361180.066718
her18220.0088286.8237060.060239
but17840.0086446.8541140.059246
not17710.0085816.8646650.058905
with17510.0084846.8810500.058378
for16710.0080966.9485180.056257
she16260.0078786.9879020.055053

Looking at the full text, our entropy has gone up a bit to 9.37009.3700, but not nearly as much as the vocabulary, which went from 1,0131,013 in the first chapter alone, to roughly ten times as much at 10,80010,800 in the full text. So why didn’t the entropy go up a lot more? The answer lies, at least in part, in the most frequent words. You’ll note that the top words in the full text, as in the first chapter alone, are still all stop words. In fact, the top two words, the and and are the same in both cases! Even though the full text has many more rare words, since entropy contributions are weighted by probability of a word, the stop words still have the most influence.

Speaking of the rare words, let’s see how the information content changed for a few words from the first chapter, compared to the full text.

crime_data['word_data'].sort_values("information", ascending=False).head(20)
wordfrequencyprobabilityinformationentropy_contribution
newsletter10.00000517.6550140.000086
punishing10.00000517.6550140.000086
person_10.00000517.6550140.000086
peaceably10.00000517.6550140.000086
errors10.00000517.6550140.000086
lettres_10.00000517.6550140.000086
peaceful10.00000517.6550140.000086
despite10.00000517.6550140.000086
dame10.00000517.6550140.000086
are10.00000517.6550140.000086
appropriately10.00000517.6550140.000086
gold10.00000517.6550140.000086
novelreading10.00000517.6550140.000086
directress10.00000517.6550140.000086
maid10.00000517.6550140.000086
laundry10.00000517.6550140.000086
novels10.00000517.6550140.000086
coughcough10.00000517.6550140.000086
cleverer10.00000517.6550140.000086
epilogue10.00000517.6550140.000086

Recall that when we looked at the first chapter alone, words that appeared a single time had information content of 11.70011.700. Now, words with a single appearance have an information content of 17.65517.655! In the context of the full text, with the vastly larger vocabulary, single appearance words carry more information, as their probability went down significantly. For comparison, let’s look at the words giantkiller and fantasy, which we saw in the first chapter.

words_of_interest = ["giantkiller", "fantasy"]
print("Interesting word values from chapter 1:")
chp_1_words = (
crime_chp_1_data['word_data'][crime_chp_1_data['word_data']['word']
.isin(words_of_interest)]
.sort_values("word")
)
display(chp_1_words)
print("\nInteresting word values from full text:")
full_text_words = (
crime_data['word_data'][crime_data['word_data']['word']
.isin(words_of_interest)]
.sort_values("word")
)
display(full_text_words)

Interesting word values from chapter 1:

wordfrequencyprobabilityinformationentropy_contribution
fantasy10.000311.700440.003516
giantkiller10.000311.700440.003516

Interesting word values from full text:

wordfrequencyprobabilityinformationentropy_contribution
fantasy40.00001915.6550140.000303
giantkiller10.00000517.6550140.000086

Here, we see that giantkiller was never used again through the rest of the text, so its probability went down, and its information content therefore went up. By comparison, fantasy appeared several more times. It’s probability still went down in the context of the full text, but not by as much as giantkiller, so it’s information content did not rise as much.

More Realistic Vocabularies

Until now, we have been building our own vocabularies based on relatively small corpora. Crime and Punishment might be a long book, with a huge vocabulary, but the probability distribution in that vocabulary is hardly representative of the language as a whole. Thankfully, libraries like spaCy have already done the hard work of calculating probabilities of words.

spaCy tokens each have an attribute prob, which provides the smoothed log probability estimate of a word (or technically, of its lexeme). The smoothed log probability is drawn from a probability distribution in a three billion word corpus (and then smoothed in order to avoid zero-counts of words that don’t appear in the corpus.) This means that the probabilities are much more representative of the language as a whole, rather than probabilities in the context of our text alone. Let’s update our function to use these probabilities.

def get_spacy_probability(word: str, nlp: spacy.Language) -> float:
"""
Get the probability of a word in the spaCy language model.
Args:
word (string): The word to get the probability for
nlp (spacy.Language): The spaCy language model
Returns:
The probability of the word (float)
"""
# ensure word is in vocab
_ = nlp.vocab[word]
# spacy prob attribute is actually a natural logarithm of the probability
ln_prob = nlp.vocab[word].prob
# convert ln probability back to probability using e^x
prob = np.exp(ln_prob)
return prob
# load language model
nlp = spacy.load("en_core_web_md")
# in order to look up the probability, we need to load the lookup table
lookups = load_lookups("en", ["lexeme_prob"])
nlp.vocab.lookups.add_table("lexeme_prob", lookups.get_table("lexeme_prob"))
print("probabilities loaded...")
sample_probs = ["the", "and", "a", "fantasy", "giantkiller"]
for word in sample_probs:
prob = get_spacy_probability(word, nlp)
print(f"{word} ({prob:.6f})")
the (0.029341)
and (0.016357)
a (0.019648)
fantasy (0.000027)
giantkiller (0.000000)
crime_data_spacy = calculate_corpus_entropy(
[crime_text],
prob_func=lambda word: get_spacy_probability(word, nlp)
)
print(f"Total entropy: {crime_data_spacy['entropy']:.4f}")
print(f"Total information content: {crime_data_spacy['information_content']:.4f}")
print("\nWord data:")
crime_data_spacy['word_data'].head(20)
Total entropy: 5.7979
Total information content: 214857.3604
wordfrequencyprobabilityinformationentropy_contribution
the79760.0293415.0909340.149374
and69680.0163575.9339610.097061
to53420.0211525.5630630.117670
he46520.0026538.5579300.022708
a46220.0196485.6694860.111393
i39360.0012459.6498570.012012
of39210.0139006.1687820.085745
you38730.0126036.3100470.079528
in32420.0098626.6639120.065719
it29830.0124256.3306180.078658
that29150.0115106.4409190.074138
was28210.0052357.5774960.039671
his21130.0013969.4846740.013239
at20750.0031408.3148900.026111
her18220.0011159.8081430.010941
but17840.0047867.7068330.036888
not17710.0048317.6933170.037170
with17510.0052837.5644110.039963
for16710.0075967.0405100.053481
she16260.0010069.9568400.010019

With the spaCy probabilities, we can see that our entropy dropped significantly, from 9.37009.3700 to 5.79795.7979. The reason is that probability distribution from the spaCy model has a significantly larger range, defined by the total vocabulary. The vocabulary from Crime and Punishment was 10,800 unique words, whereas the vocabulary in the en_core_web_md model is 710,657. As a result, each individual word has less information content, and smaller entropy contribution.

Let’s go back to a few interesting words for direct comparisons.

words_of_interest = ["the", "a", "crime", "punishment", "giantkiller", "fantasy"]
pd.set_option('display.float_format', lambda x: '%.6f' % x)
print("\nInteresting word values with in-context probability:")
full_text_words = (
crime_data['word_data'][crime_data['word_data']['word']
.isin(words_of_interest)]
.sort_values("word")
)
display(full_text_words)
print("\nInteresting word values with spaCy probability:")
full_text_words_spacy = (
crime_data_spacy['word_data'][crime_data_spacy['word_data']['word']
.isin(words_of_interest)]
.sort_values("word")
)
display(full_text_words_spacy)

Interesting word values with in-context probability:

wordfrequencyprobabilityinformationentropy_contribution
the79760.0386454.6935640.181384
a46220.0223945.4807120.122738
crime520.00025211.9545740.003012
fantasy40.00001915.6550140.000303
giantkiller10.00000517.6550140.000086
punishment40.00001915.6550140.000303

Interesting word values with spaCy probability:

wordfrequencyprobabilityinformationentropy_contribution
the79760.0293415.0909340.149374
a46220.0196485.6694860.111393
crime520.00005014.2936830.000712
fantasy40.00002715.1790800.000409
giantkiller10.00000028.8539010.000000
punishment40.00002015.6269080.000309

Using the probabilities provided by spaCy, we can see significant changes compared to our vocab in Crime and Punishment alone. For example, the most common word from Crime and Punishment, the, went from a probability of 0.03860.0386 to 0.0290.029, and had a corresponding change in information up from 4.6934.693 to 5.0905.090, but also a drop in entropy contribution from 0.1810.181 to 0.1490.149.

Final Thoughts

Over the course of this discussion, we’ve learned a lot about entropy and information content. But, we’ve also left a lot out! The biggest omission is probably the question of using conditional probability in our calculations. As with letters, we know that words in a language follow certain patterns, meaning that given a particular word, or a particular set of words, the probability distribution changes for what word is most likely to come next. Of course, this is the whole problem being solved by large language models! That’s obviously a much bigger topic than we can cover here.

The important point to take away from this discussion is that entropy is one of many tools we can use to understand the complexity of a text. It tells us something not only about the information content of the text, but also about our efficiency and effectiveness as communicators. So, next time you’re evaluating a text, think about calculating its entropy!