Section 1.5: Perplexity — The Standard Evaluation Metric¶
Reading time: 12 minutes | Difficulty: ★★☆☆☆
Cross-entropy is the theoretically correct metric, but it's hard to interpret. "Our model has cross-entropy 4.2 bits" doesn't mean much intuitively.
Perplexity fixes this by converting cross-entropy into an interpretable number.
From Cross-Entropy to Perplexity¶
Definition: Perplexity is the exponential of cross-entropy:
Or equivalently, if using natural logarithms:
Note on logarithm base: Both formulas give the same perplexity value! With log₂, we exponentiate with base 2. With ln, we use exp (base e). The choice only affects the intermediate cross-entropy value, not the final perplexity. In code, ln is typically used for numerical convenience.
Why exponentiate? To return from log-space to probability-space, giving us an interpretable number.
The Intuitive Interpretation¶
Perplexity = effective vocabulary size
If your model has perplexity K, it's as uncertain as if it were choosing uniformly among K options at each position.
Example interpretations: | Perplexity | Interpretation | |------------|----------------| | 1 | Perfect prediction (always 100% confident in correct answer) | | 10 | Like choosing among 10 equally likely words | | 100 | Like choosing among 100 equally likely words | | 50,000 | Like random guessing over entire vocabulary |
A good language model should have perplexity much lower than vocabulary size.
Deriving the "Effective Vocabulary" Interpretation¶
Let's prove that perplexity equals effective vocabulary size.
Consider a uniform distribution over K items: Each item has probability 1/K.
Cross-entropy of uniform Q on any P with support K:
Perplexity:
So a uniform distribution over K items has perplexity K.
The converse: If a model has perplexity K, it has the same average uncertainty as uniform distribution over K items.
Computing Perplexity: A Concrete Example¶
Test sequence: "the cat" (pretend this is our entire test set)
Model predictions (bigram): - P("the" | START) = 0.2 - P("cat" | "the") = 0.1 - P(END | "cat") = 0.3
Step 1: Compute log-probabilities - log₂(0.2) = -2.32 - log₂(0.1) = -3.32 - log₂(0.3) = -1.74
Step 2: Average negative log-probability
Step 3: Exponentiate
Interpretation: On average, the model was as uncertain as choosing among 5.5 equally likely tokens.
Perplexity vs. Accuracy¶
Why not just use accuracy (% of correct predictions)?
Problem: Accuracy ignores confidence.
Consider two models predicting "cat":
- Model A: P("cat") = 0.51, P("dog") = 0.49
- Model B: P("cat") = 0.99, P("dog") = 0.01
Both have 100% accuracy if "cat" is correct, but Model B is clearly better.
Perplexity (via log-probability) captures this:
- Model A contribution: -log₂(0.51) = 0.97 bits
- Model B contribution: -log₂(0.99) = 0.01 bits
Model B has much lower perplexity because it's more confident in correct answers.
Properties of Perplexity¶
1. Lower is better: Lower perplexity = better model
2. Bounded below by 1: Perplexity ≥ 1, with equality only for perfect prediction
3. Bounded above by vocabulary size: PPL ≤ |V| (achieved by uniform random guessing)
4. Infinite if any probability is 0: If the model assigns 0 probability to an observed token, perplexity = ∞
5. Geometric mean interpretation:
Perplexity is the geometric mean of the inverse probabilities.
Perplexity on Train vs. Test¶
Training perplexity: Evaluate model on data it was trained on. Test perplexity: Evaluate model on held-out data (data set aside before training and used only for evaluation)—data the model never saw during training.
Critical insight: Training perplexity always looks better (or equal).
| What it measures | What you want |
|---|---|
| Train perplexity | How well model fits training data |
| Test perplexity | How well model generalizes |
Overfitting: When train perplexity << test perplexity, the model memorized training data but doesn't generalize.
The Relationship Hierarchy¶
Let's connect all the metrics we've defined:
Probability assigned to test data: P(test | model)
↓ take log
Log-likelihood: log P(test | model)
↓ negate and average
Cross-entropy: H = -(1/N) · log P(test | model)
↓ exponentiate
Perplexity: PPL = exp(H)
All contain the same information, just different presentations:
- Log-likelihood: raw sum (for optimization)
- Cross-entropy: normalized (for comparison across corpus sizes)
- Perplexity: intuitive (for human interpretation)
Perplexity in Practice: Real Numbers¶
What perplexity values are typical?
| Model | Perplexity | Dataset |
|---|---|---|
| Unigram (word) | ~1000 | Typical English |
| Bigram (word) | ~150-300 | Typical English |
| Trigram (word) | ~100-150 | Typical English |
| Neural LM (LSTM) | ~50-80 | Penn Treebank |
| GPT-2 (small) | ~35 | Penn Treebank |
| GPT-3 | ~20 | Various |
Character-level models have different scales:
| Model | Perplexity | Dataset |
|---|---|---|
| Unigram (char) | ~27 | English text |
| Order-3 Markov (char) | ~8-12 | English text |
| Order-5 Markov (char) | ~4-6 | English text (but often ∞ on test) |
The Overfitting Pattern¶
For Markov models, you'll observe this pattern as you increase order:
| Order | Train PPL | Test PPL | States |
|---|---|---|---|
| 1 | 15 | 15 | 50 |
| 2 | 8 | 9 | 500 |
| 3 | 4 | 12 | 5,000 |
| 5 | 1.5 | ∞ | 50,000 |
What's happening:
- Train PPL keeps improving (more context = better fit)
- Test PPL improves initially (capturing real patterns)
- Test PPL then explodes (model sees unseen n-grams, assigns probability 0)
This is the fundamental limitation of Markov models that we'll address with neural networks.
Connection to Modern LLMs
Perplexity remains THE standard metric for evaluating language models. When OpenAI reports "GPT-4 achieves X perplexity on benchmark Y," they're using exactly the formula we derived here.
Modern LLM leaderboards (like the Hugging Face Open LLM Leaderboard) report perplexity alongside other metrics. A perplexity of 20 on WikiText means the model is, on average, as uncertain as choosing among 20 equally likely tokens—even though the vocabulary has 50,000+ tokens.
Historical Note: Shannon's Experiments (1951)
Claude Shannon, the father of information theory, conducted experiments measuring the entropy of English in his 1951 paper "Prediction and Entropy of Printed English." He had humans predict the next character and measured their error rate.
Shannon estimated English has about 1.0-1.3 bits per character of entropy. This corresponds to perplexity 2.0-2.5 per character—remarkably close to what modern character-level models achieve!
Common Mistake: Comparing Perplexities Across Different Tokenizations
You CANNOT directly compare perplexity between: - Character-level and word-level models - Models with different vocabularies - Different test sets
A character model with PPL=5 and word model with PPL=100 are not comparable. The character model predicts 1 character at a time; the word model predicts entire words. Always compare apples to apples.
Summary¶
| Concept | Formula | Interpretation |
|---|---|---|
| Perplexity | \(2^{cross-entropy}\) | Effective vocabulary size |
| PPL = 1 | Perfect model | Always correct with 100% confidence |
| PPL = |V| | Random guessing | No information from context |
| PPL = ∞ | Model assigns P=0 | Considered token impossible |
Key takeaways:
- Perplexity is the standard metric for language models
- Lower is better
- It measures how "surprised" the model is on average
- Compare train vs. test to detect overfitting
Next: How do we generate text from our model? This requires understanding sampling and temperature.