Section 1.5: Perplexity — The Standard Evaluation Metric¶

Reading time: 12 minutes | Difficulty: ★★☆☆☆

Cross-entropy is the theoretically correct metric, but it's hard to interpret. "Our model has cross-entropy 4.2 bits" doesn't mean much intuitively.

Perplexity fixes this by converting cross-entropy into an interpretable number.

From Cross-Entropy to Perplexity¶

Definition: Perplexity is the exponential of cross-entropy:

\[\text{Perplexity} = 2^{H(P,Q)} = 2^{-\frac{1}{N}\sum_i \log_2 Q(x_i | \text{context})}\]

Or equivalently, if using natural logarithms:

\[\text{Perplexity} = \exp\left(-\frac{1}{N}\sum_i \ln Q(x_i | \text{context})\right)\]

Note on logarithm base: Both formulas give the same perplexity value! With log₂, we exponentiate with base 2. With ln, we use exp (base e). The choice only affects the intermediate cross-entropy value, not the final perplexity. In code, ln is typically used for numerical convenience.

Why exponentiate? To return from log-space to probability-space, giving us an interpretable number.

The Intuitive Interpretation¶

Perplexity = effective vocabulary size

If your model has perplexity K, it's as uncertain as if it were choosing uniformly among K options at each position.

Example interpretations: | Perplexity | Interpretation | |------------|----------------| | 1 | Perfect prediction (always 100% confident in correct answer) | | 10 | Like choosing among 10 equally likely words | | 100 | Like choosing among 100 equally likely words | | 50,000 | Like random guessing over entire vocabulary |

A good language model should have perplexity much lower than vocabulary size.

Deriving the "Effective Vocabulary" Interpretation¶

Let's prove that perplexity equals effective vocabulary size.

Consider a uniform distribution over K items: Each item has probability 1/K.

Cross-entropy of uniform Q on any P with support K:

\[H = -\sum_i P(x_i) \log_2 \frac{1}{K} = -\sum_i P(x_i) \cdot (-\log_2 K) = \log_2 K\]

Perplexity:

\[\text{PPL} = 2^{\log_2 K} = K\]

So a uniform distribution over K items has perplexity K.

The converse: If a model has perplexity K, it has the same average uncertainty as uniform distribution over K items.

Computing Perplexity: A Concrete Example¶

Test sequence: "the cat" (pretend this is our entire test set)

Model predictions (bigram): - P("the" | START) = 0.2 - P("cat" | "the") = 0.1 - P(END | "cat") = 0.3

Step 1: Compute log-probabilities - log₂(0.2) = -2.32 - log₂(0.1) = -3.32 - log₂(0.3) = -1.74

Step 2: Average negative log-probability

\[H = -\frac{1}{3}(-2.32 - 3.32 - 1.74) = \frac{7.38}{3} = 2.46 \text{ bits}\]

Step 3: Exponentiate

\[\text{PPL} = 2^{2.46} = 5.5\]

Interpretation: On average, the model was as uncertain as choosing among 5.5 equally likely tokens.

Perplexity vs. Accuracy¶

Why not just use accuracy (% of correct predictions)?

Problem: Accuracy ignores confidence.

Consider two models predicting "cat":

Model A: P("cat") = 0.51, P("dog") = 0.49
Model B: P("cat") = 0.99, P("dog") = 0.01

Both have 100% accuracy if "cat" is correct, but Model B is clearly better.

Perplexity (via log-probability) captures this:

Model A contribution: -log₂(0.51) = 0.97 bits
Model B contribution: -log₂(0.99) = 0.01 bits

Model B has much lower perplexity because it's more confident in correct answers.

Properties of Perplexity¶

1. Lower is better: Lower perplexity = better model

2. Bounded below by 1: Perplexity ≥ 1, with equality only for perfect prediction

3. Bounded above by vocabulary size: PPL ≤ |V| (achieved by uniform random guessing)

4. Infinite if any probability is 0: If the model assigns 0 probability to an observed token, perplexity = ∞

5. Geometric mean interpretation:

\[\text{PPL} = \left( \prod_{i=1}^N \frac{1}{Q(x_i | \text{context})} \right)^{1/N}\]

Perplexity is the geometric mean of the inverse probabilities.

Perplexity on Train vs. Test¶

Training perplexity: Evaluate model on data it was trained on. Test perplexity: Evaluate model on held-out data (data set aside before training and used only for evaluation)—data the model never saw during training.

Critical insight: Training perplexity always looks better (or equal).

What it measures	What you want
Train perplexity	How well model fits training data
Test perplexity	How well model generalizes

Overfitting: When train perplexity << test perplexity, the model memorized training data but doesn't generalize.

The Relationship Hierarchy¶

Let's connect all the metrics we've defined:

Probability assigned to test data: P(test | model)
         ↓ take log
Log-likelihood: log P(test | model)
         ↓ negate and average
Cross-entropy: H = -(1/N) · log P(test | model)
         ↓ exponentiate
Perplexity: PPL = exp(H)

All contain the same information, just different presentations:

Log-likelihood: raw sum (for optimization)
Cross-entropy: normalized (for comparison across corpus sizes)
Perplexity: intuitive (for human interpretation)

Perplexity in Practice: Real Numbers¶

What perplexity values are typical?

Model	Perplexity	Dataset
Unigram (word)	~1000	Typical English
Bigram (word)	~150-300	Typical English
Trigram (word)	~100-150	Typical English
Neural LM (LSTM)	~50-80	Penn Treebank
GPT-2 (small)	~35	Penn Treebank
GPT-3	~20	Various

Character-level models have different scales:

Model	Perplexity	Dataset
Unigram (char)	~27	English text
Order-3 Markov (char)	~8-12	English text
Order-5 Markov (char)	~4-6	English text (but often ∞ on test)

The Overfitting Pattern¶

For Markov models, you'll observe this pattern as you increase order:

Order	Train PPL	Test PPL	States
1	15	15	50
2	8	9	500
3	4	12	5,000
5	1.5	∞	50,000

What's happening:

Train PPL keeps improving (more context = better fit)
Test PPL improves initially (capturing real patterns)
Test PPL then explodes (model sees unseen n-grams, assigns probability 0)

This is the fundamental limitation of Markov models that we'll address with neural networks.

Connection to Modern LLMs

Perplexity remains THE standard metric for evaluating language models. When OpenAI reports "GPT-4 achieves X perplexity on benchmark Y," they're using exactly the formula we derived here.

Modern LLM leaderboards (like the Hugging Face Open LLM Leaderboard) report perplexity alongside other metrics. A perplexity of 20 on WikiText means the model is, on average, as uncertain as choosing among 20 equally likely tokens—even though the vocabulary has 50,000+ tokens.

Historical Note: Shannon's Experiments (1951)

Claude Shannon, the father of information theory, conducted experiments measuring the entropy of English in his 1951 paper "Prediction and Entropy of Printed English." He had humans predict the next character and measured their error rate.

Shannon estimated English has about 1.0-1.3 bits per character of entropy. This corresponds to perplexity 2.0-2.5 per character—remarkably close to what modern character-level models achieve!

Common Mistake: Comparing Perplexities Across Different Tokenizations

You CANNOT directly compare perplexity between: - Character-level and word-level models - Models with different vocabularies - Different test sets

A character model with PPL=5 and word model with PPL=100 are not comparable. The character model predicts 1 character at a time; the word model predicts entire words. Always compare apples to apples.

Summary¶

Concept	Formula	Interpretation
Perplexity	\(2^{cross-entropy}\)	Effective vocabulary size
PPL = 1	Perfect model	Always correct with 100% confidence
PPL = \|V\|	Random guessing	No information from context
PPL = ∞	Model assigns P=0	Considered token impossible

Key takeaways:

Perplexity is the standard metric for language models
Lower is better
It measures how "surprised" the model is on average
Compare train vs. test to detect overfitting

Next: How do we generate text from our model? This requires understanding sampling and temperature.