Section 1.8: The Fundamental Trade-offs¶

We've built a complete language model. Now let's understand its fundamental limitations and why we need something better.

The Context-Sparsity Trade-off¶

This is the central tension in Markov models:

More context = better predictions - A bigram model sees "the" and might predict "cat", "dog", "end"... - A 5-gram model seeing "the cat sat on" can confidently predict "the"

More context = sparser observations - Every possible 5-gram needs to be observed in training - Most valid 5-grams never appear in any finite corpus

Quantifying the Trade-off¶

For character-level modeling with |V| = 27 (letters + space):

Order	Possible contexts	Typical observed	Coverage
1	27	27	100%
2	729	~400	55%
3	19,683	~5,000	25%
5	14.3M	~50,000	0.3%
7	10.5B	~100,000	0.001%

At order 5, we've seen less than 1% of possible contexts!

The Experimental Evidence¶

Here's what happens with a typical corpus (100,000 characters of English text):

Order	Train PPL	Test PPL	States	Unseen rate
1	12.4	12.5	27	0%
2	7.2	7.8	420	2%
3	4.1	6.3	4,800	15%
5	1.8	∞	48,000	62%
7	1.1	∞	95,000	89%

The pattern:

Train perplexity keeps improving (memorization)
Test perplexity improves, then explodes (overfitting)
Eventually, test perplexity becomes infinite (unseen n-grams)

Classical Solutions: Smoothing Techniques¶

The infinite test perplexity at high orders is caused by assigning probability 0 to unseen n-grams. Several classical solutions exist:

Laplace (Add-One) Smoothing¶

Pretend we saw every n-gram one additional time:

\[P(x_i | c) = \frac{\text{count}(c, x_i) + 1}{\text{count}(c) + |V|}\]

Effect: No probability is ever zero, but adds bias (assumes all unseen n-grams are equally likely).

Add-k Smoothing¶

Generalization with tunable parameter k < 1:

\[P(x_i | c) = \frac{\text{count}(c, x_i) + k}{\text{count}(c) + k \cdot |V|}\]

Effect: Less bias than Laplace when k < 1, but k must be tuned on held-out data.

Backoff (Katz Backoff)¶

Use lower-order model when higher-order has no data:

\[P(x_i | c) = \begin{cases} \frac{\text{count}(c, x_i)}{\text{count}(c)} & \text{if count}(c, x_i) > 0 \\ \alpha(c) \cdot P(x_i | c_{\text{shorter}}) & \text{otherwise} \end{cases}\]

where α(c) is a normalization factor to ensure probabilities sum to 1.

Interpolation (Jelinek-Mercer)¶

Weighted combination of all orders:

\[P(x_i | c) = \lambda_k P_k(x_i | c) + \lambda_{k-1} P_{k-1}(x_i | c) + \ldots + \lambda_0 P_0(x_i)\]

where Σλᵢ = 1 and Pₖ is the k-th order model.

Why These Don't Fully Solve the Problem¶

Smoothing techniques help with the zero-probability problem, but they don't address the fundamental issue: no notion of similarity. "The cat sat" and "the dog sat" remain completely independent—learning from one doesn't help predict the other. This is what neural networks fundamentally fix.

Why This Is Fundamental¶

This isn't a bug in our implementation—it's a fundamental limitation of the approach.

The problem: Markov models require exact pattern matching.

Training: "the cat sat"
Test: "the cat lay"

Even though "sat" and "lay" are syntactically similar (both verbs following "cat"), the trigram model treats them as completely unrelated. It has no notion of similarity.

What we need: A way to generalize from seen patterns to unseen but similar patterns.

This is exactly what neural networks provide.

State Space Explosion¶

The number of states grows exponentially with order:

\[\text{States} = |V|^k\]

For word-level models with |V| = 50,000:

Order	States	Storage (4 bytes each)
1	50K	200 KB
2	2.5B	10 GB
3	125T	500 TB

Even storing just the non-zero entries, we run into the sparsity problem.

Long-Range Dependencies¶

Language has structure that spans far beyond what Markov models can capture:

Example 1: Subject-verb agreement (distance: 7 words) "The cats that sat on the mat were sleeping."

A bigram model sees "mat were"—grammatical but not because of local patterns.

Example 2: Coreference (distance: variable) "John walked into the room. He looked around. His eyes..."

"He" and "His" must agree with "John" from sentences ago.

Example 3: Topic coherence (distance: paragraph) An article about physics should maintain physics vocabulary throughout.

Markov models are blind to all of this.

What We Need: The Preview¶

To fix these limitations, we need:

Limitation	Solution	Stage
No generalization	Learned representations (embeddings)	4
No gradient-based learning	Automatic differentiation	2-3
No flexible context	Attention mechanism	7
Fixed context size	Transformer architecture	8

The key insight: Instead of storing explicit counts for every pattern, we learn a function that maps contexts to predictions. This function can generalize to unseen contexts.

What Carries Forward¶

Despite the limitations, Markov models introduced concepts that underpin all modern LLMs:

Concept	First Introduced	Where It Appears Later
Autoregressive factorization	Chain rule → Markov	GPT, LLaMA, Claude
Next-token prediction	MLE objective	All modern LLMs
Perplexity	Evaluation metric	Standard LLM benchmark
Temperature sampling	Generation control	ChatGPT, Claude, etc.
Cross-entropy loss	MLE ≡ min cross-entropy	All neural LM training

These aren't simplified versions—they're the same concepts.

GPT-4 uses the same autoregressive factorization, the same cross-entropy objective, and the same temperature-controlled sampling. The only difference is how P(next | context) is computed.

The Bias-Variance Perspective¶

From statistical learning theory:

Bias: Error from the model's assumptions - High-order Markov: Low bias (few assumptions) - Low-order Markov: High bias (strong assumptions about independence)

Variance: Error from sensitivity to training data - High-order Markov: High variance (sensitive to exact patterns) - Low-order Markov: Low variance (stable estimates)

The trade-off: We can't minimize both simultaneously.

Neural networks navigate this differently:

Flexible function class (low bias)
Regularization controls variance
Learning algorithm finds good solutions

Exercises¶

Smoothing: Implement Laplace smoothing and show it prevents infinite perplexity.
Order sweep: Plot train and test perplexity vs. order for a corpus of your choice. Find the "elbow" where test perplexity starts increasing.
Vocabulary comparison: Compare character-level and word-level models on the same corpus. How do optimal orders differ?
State space visualization: For a trigram model, visualize the transition graph. Which states have the most outgoing edges?
Temperature analysis: Generate 100 samples at T=0.5, 1.0, and 2.0. Compute the average perplexity of each set. What do you observe?

Summary¶

What we built:

A complete language model from first principles
Training, evaluation, and generation
Full mathematical derivations for everything

What we learned:

Probability theory foundations
Chain rule → autoregressive factorization
MLE → counting is optimal
Information theory → cross-entropy → perplexity
Temperature → controlled sampling

Why it's limited:

Context-sparsity trade-off is fundamental
No generalization to similar-but-unseen patterns
Exponential state space explosion
Can't capture long-range dependencies

What's next: To build models that generalize, we need to learn representations. That requires gradients. And computing gradients efficiently requires automatic differentiation.

→ Stage 2: Automatic Differentiation

Reflection: The Pólya Method¶

Looking back at this stage through Pólya's problem-solving framework:

1. Understand the problem - We want P(next token | context) - The space of sequences is exponentially large - We need both tractability and accuracy

2. Devise a plan - Use chain rule to factorize - Make Markov assumption to limit context - Estimate probabilities by counting (MLE)

3. Execute the plan - Implemented counting-based training - Added temperature-controlled sampling - Evaluated with perplexity

4. Reflect - The approach works but has fundamental limits - Context-sparsity trade-off can't be avoided - Need a different approach (neural networks) for better generalization

This reflection pattern—understand, plan, execute, reflect—will continue throughout the book. Each stage builds on the previous, and each reflection motivates the next stage.