Stage 1: The Simplest Language Model¶
Markov Chains — Where It All Begins
This stage builds a language model from absolute first principles. Every concept is derived, not stated. By the end, you'll understand not just what language models do, but why they work mathematically.
Reading Time
60-90 minutes | Prerequisites: Basic Python, high school algebra
Overview¶
We begin with the simplest possible language model: a Markov chain. Despite its simplicity, this model introduces concepts that underpin all modern LLMs:
- Autoregressive factorization — the same decomposition used by GPT, LLaMA, and Claude
- Maximum likelihood estimation — the same training objective
- Cross-entropy loss — the same loss function
- Temperature sampling — the same generation technique
- Perplexity — the same evaluation metric
The difference between a Markov chain and GPT-4 isn't in what they compute, but how they compute it.
What You'll Learn¶
-
Probability Foundations — Kolmogorov axioms, conditional probability, chain rule (proved by induction)
-
The Language Modeling Problem — Why exponential space makes direct modeling impossible
-
Maximum Likelihood Estimation — Full Lagrangian derivation proving counting = optimal
-
Information Theory — Entropy, cross-entropy, KL divergence derived from axioms
-
Perplexity — The standard evaluation metric, with the "effective vocabulary" interpretation
-
Temperature Sampling — From softmax to Boltzmann distribution
-
Implementation — Complete ~150-line implementation with every design decision explained
-
The Fundamental Trade-offs — Why we need neural networks
Key Formulas¶
| Concept | Formula | Intuition |
|---|---|---|
| Chain Rule | \(P(x_{1:n}) = \prod_i P(x_i \mid x_{<i})\) | Factor into conditionals |
| MLE | \(P(b\mid a) = \frac{\text{count}(a,b)}{\text{count}(a,\cdot)}\) | Counting is optimal |
| Cross-entropy | \(H(P,Q) = -\mathbb{E}_P[\log Q]\) | Average surprise |
| Perplexity | \(\exp(H)\) | Effective vocabulary |
| Temperature | \(P'(x) \propto P(x)^{1/T}\) | Control randomness |
The Central Insight¶
The Fundamental Trade-off
More context → better predictions
More context → sparser observations
We prove this quantitatively and show experimental data. This limitation is why we need neural networks: they can generalize from similar patterns rather than requiring exact matches.
Begin Reading¶
Start with the foundations:
→ Section 1.1: Probability Foundations
Code & Resources¶
| Resource | Description |
|---|---|
code/stage-01/markov.py |
Reference implementation |
code/stage-01/tests/ |
Test suite |
| Exercises | Practice problems |
| Common Mistakes | Debugging guide |