Section 1.9: Why We Need Neural Networks¶
Reading time: 10 minutes | Difficulty: ★★☆☆☆
We've built a complete language model from first principles. It works. But it has fundamental limitations that no amount of clever engineering can fix. This section explains why we need neural networks—not just that we do, but why counting fails.
The Generalization Gap¶
Consider this training corpus:
Now test on:
What Markov Models See¶
A bigram model treats each context-token pair as completely independent:
| Bigram | Count | P(next|context) |
|---|---|---|
| (cat, sat) | 1 | 1.0 |
| (cat, lay) | 0 | 0.0 |
| (dog, lay) | 1 | 1.0 |
| (dog, sat) | 0 | 0.0 |
To the model, "cat lay" and "dog sat" are impossible—even though they're perfectly valid English.
What Humans See¶
Humans recognize:
- "cat" and "dog" are both animals
- "sat" and "lay" are both position verbs
- If "the dog lay" is valid, "the cat lay" should be too
The model has no way to make these connections. Each bigram is an island.
The Similarity Problem¶
The core issue: Markov models have no concept of similarity.
To a Markov model, "cat" is as different from "dog" as it is from "quantum" or "banana". There's no notion that some words are more related than others.
What We Need: Representations¶
Imagine if we could represent words as vectors where:
- Similar words have similar vectors
- Operations on vectors capture semantic relationships
# Hypothetical embeddings (what we'll build in Stage 3)
embed("cat") = [0.2, 0.8, 0.1, ...] # furry, pet, small
embed("dog") = [0.3, 0.7, 0.2, ...] # furry, pet, medium
embed("car") = [0.0, 0.0, 0.9, ...] # metal, vehicle
# cat and dog are "close" in this space
distance(embed("cat"), embed("dog")) = 0.15 # small
distance(embed("cat"), embed("car")) = 0.95 # large
With such representations, a model could learn:
"After animals, position verbs are likely"
This is a general rule that applies to all animals and all position verbs—including combinations never seen in training.
Connection to Modern LLMs
GPT-4, Claude, and LLaMA all use embeddings—learned vector representations of tokens. The embedding layer is literally the first thing the input passes through. A 50,000-token vocabulary becomes 50,000 vectors of dimension 4,096 or more.
These embeddings capture rich semantic relationships: "king - man + woman ≈ queen" is a famous example from word2vec that still works in modern embeddings.
The Memory Problem¶
Another fundamental limitation: Markov models store everything explicitly.
Counting vs. Learning¶
Markov approach: Store a count for every observed n-gram. - Storage: O(number of unique n-grams) - For trigrams on a 50K vocabulary: up to 50,000³ = 125 trillion entries - In practice, sparse storage, but still grows with data
Neural approach: Learn a function that maps context → probabilities. - Storage: O(number of parameters) — fixed, regardless of data size - GPT-2: 1.5 billion parameters handles effectively infinite patterns - Parameters are shared across all contexts
A Concrete Example¶
Training data: 1 billion words of English.
Trigram model:
- Stores: ~500 million unique trigrams
- Each with counts
- Lookup table approach
Neural model:
- Learns: patterns like "adjective before noun", "verb after subject"
- Applies patterns to any input, including novel combinations
- Compresses the data into general rules
The Composition Problem¶
Language is compositional—meaning builds from parts:
A Markov model must see each instantiation:
- "the big cat chased the mouse"
- "the small dog chased the ball"
- "the angry bird chased the worm"
A neural model can learn the template and fill it with any appropriate words.
Syntactic Patterns¶
Consider subject-verb agreement across distance:
A bigram model sees "mat were" and has no idea why "were" is correct. It can't see "cats" from that position.
A transformer with attention can directly connect "cats" to "were" regardless of distance.
What Neural Networks Provide¶
| Capability | Markov | Neural |
|---|---|---|
| Exact pattern recall | ✓ | ✓ |
| Generalization to similar patterns | ✗ | ✓ |
| Long-range dependencies | ✗ | ✓ |
| Compositional structure | ✗ | ✓ |
| Fixed memory footprint | ✗ | ✓ |
| Gradient-based optimization | ✗ | ✓ |
The Path Forward¶
To build models that generalize, we need:
- Continuous representations (embeddings)
- Map discrete tokens to vectors
- Similar tokens → similar vectors
-
Covered in Stage 3
-
Differentiable functions
- Learn from gradients, not counts
- Requires automatic differentiation
-
Covered in Stage 2
-
Flexible architectures
- Neural networks that can capture complex patterns
- Eventually: transformers with attention
- Covered in Stages 3-8
A Preview: The Neural Language Model¶
Here's what we're building toward (Stage 3):
class NeuralLM:
def __init__(self, vocab_size, embed_dim, hidden_dim):
# Embedding: token → vector
self.embed = Embedding(vocab_size, embed_dim)
# Hidden layer: captures patterns
self.hidden = Linear(embed_dim * context_size, hidden_dim)
# Output: vector → probabilities
self.output = Linear(hidden_dim, vocab_size)
def forward(self, context):
# Embed each token in context
x = self.embed(context) # [context_size, embed_dim]
# Concatenate embeddings
x = x.flatten() # [context_size * embed_dim]
# Transform through hidden layer
x = relu(self.hidden(x)) # [hidden_dim]
# Project to vocabulary
logits = self.output(x) # [vocab_size]
# Convert to probabilities
return softmax(logits)
The key insight: The same parameters are used for all contexts. Knowledge about "cat" informs predictions about "dog" because they have similar embeddings.
But First: Gradients¶
Before we can train neural networks, we need to compute gradients—derivatives that tell us how to update parameters to reduce loss.
Computing gradients by hand for every operation would be tedious and error-prone. Instead, we'll build automatic differentiation: a system that computes gradients for any computation, automatically.
→ Stage 2: Automatic Differentiation
Exercises¶
-
Impossible Sentences: Create a list of 10 sentences that are valid English but would have zero probability under a bigram model trained on any reasonable corpus.
-
Similarity Matrix: For 10 common words, create a manual "similarity matrix" (1 = similar, 0 = different). How would you use this to improve a Markov model?
-
Compression Ratio: If we have 100 million words of training data and a neural model with 100 million parameters, what's the "compression ratio"? What does this imply about what the model learned?
Summary¶
| Limitation | Why It's Fundamental | Neural Solution |
|---|---|---|
| No generalization | Exact match required | Learned similarity |
| Exponential states | One state per n-gram | Shared parameters |
| Fixed context | Order-k assumption | Attention mechanism |
| No composition | Flat pattern matching | Hierarchical representations |
The key insight: Markov models memorize; neural networks generalize. Memorization works for patterns you've seen. Generalization works for the infinite space of patterns you haven't.
Language is infinite. We need models that can generalize.
→ Next: Stage 2 - Automatic Differentiation