flowchart LR
A["Token A"] --> PTH["Previous Token Head<br/>(Layer 1)"]
B["Token B"] --> PTH
PTH --> RS["Residual Stream:<br/>'B follows A'"]
RS --> IH["Induction Head<br/>(Layer 2)"]
A2["Token A<br/>(repeated)"] --> IH
IH --> OUT["Predict: B"]
19 Induction Heads
A complete case study
- How induction heads enable in-context learning (few-shot prompting)
- The two-layer circuit: previous-token head + induction head
- The phase transition: how induction heads emerge suddenly during training
- How to apply attribution, patching, and ablation to verify a circuit
Required: All of Arc II (Features, Superposition, Circuits) and Arc III (SAEs, Attribution, Patching, Ablation) This chapter synthesizes everything you’ve learned.
From Arc II & III, recall:
- Features are directions; circuits are how they compose (Chapters 5, 8)
- SAEs extract interpretable features from superposition (Chapter 9)
- Attribution finds candidates; patching tests causation; ablation tests necessity (Chapters 10-12)
We have theory and tools. Now we apply everything: reverse-engineer a complete circuit.
19.1 The Complete Picture
We’ve built a complete interpretability toolkit:
Theory (Arc II): - Features as directions - Superposition and sparsity - Circuits as composable algorithms
Techniques (Arc III): - SAEs for extracting features - Attribution for finding correlations - Patching for testing causation - Ablation for identifying necessity
Now we apply everything to understand a single capability: in-context learning—the ability of language models to learn from examples within a prompt, without any gradient updates.
This chapter is a complete case study. We’ll reverse-engineer the circuit responsible for this capability, using every technique we’ve learned.
In-context learning: Given examples in a prompt, the model predicts continuations that follow the demonstrated pattern—without any training.
Example: “The cat chased the mouse. The dog chased the ___” → “bone” (following the “[animal] chased [prey/toy]” pattern)
How does this work? The answer is induction heads.
19.2 The Pattern
Induction heads detect and continue copying patterns.
The simplest case:
Input: [A] [B] ... [A] → ?
Output: [B]
When the model sees token \(A\) followed later by token \(B\), then encounters \(A\) again, it predicts \(B\).
19.2.1 Concrete Examples
Exact copying:
the quick brown fox jumped over the quick → brown
Pattern generalization:
When John went to the store, John → when
(The model predicts “when” would follow “John” the second time, mirroring the structure after the first “John”)
In-context learning:
A: Paris. Q: France. A: Berlin. Q: Germany. A: Madrid. Q: → Spain
(The model learns the Q-A pattern from examples in the prompt)
All three involve the same core mechanism: detecting a repeated token and retrieving what followed it previously.
Induction heads are the primary mechanism for few-shot learning in transformers. When you show GPT-4 three examples and it generalizes on the fourth, induction heads are doing much of the work. Understanding induction heads is understanding how transformers learn from context.
19.3 The Two-Layer Circuit
Induction heads aren’t a single component—they’re a circuit involving multiple attention heads across two layers.
In 2022, Catherine Olsson, Nelson Elhage, and collaborators at Anthropic were tracking how language models develop capabilities during training. They noticed something strange: at a specific point in training, both in-context learning ability and a distinctive attention pattern appeared simultaneously—within just a few thousand steps. When they investigated, they found a beautiful two-layer circuit. The first layer records “what follows each token.” The second layer uses that information to copy patterns. They called these “induction heads” because they perform induction—generalizing from examples. The discovery was published as “In-context Learning and Induction Heads” and remains one of the clearest examples of reverse-engineering a neural network capability.
19.3.1 The Algorithm
Layer 1: Previous Token Head - Attention pattern: Each position attends to the previous token - Effect: Writes information to the residual stream indicating “token \(B\) came after token \(A\)”
Layer 2: Induction Head - Attention pattern: Looks for positions where the previous token matches the current token - Effect: When it finds a match, it copies the token that followed
The composition: 1. At position \(i\), token is \(A\) 2. Previous token head at position \(i+1\) writes “\(B\) follows \(A\)” into the residual stream 3. Later, at position \(j\), token is again \(A\) 4. Induction head at position \(j\) searches for positions where the previous token was also \(A\) 5. Finds position \(i\) (where \(A\) occurred before) 6. Copies the token at position \(i+1\), which is \(B\)
Previous token head: “Record what came before” Induction head: “Find where this token appeared before and copy what followed”
Together: pattern matching and retrieval.
19.3.2 Interactive: Watch the Induction Algorithm
Step through the induction head algorithm to see how two layers work together. The Previous Token Head records what follows each token, and the Induction Head uses this to predict continuations when tokens repeat.
Code
viewof currentStep = Inputs.range([0, 5], {step: 1, value: 0, label: "Algorithm step"})
tokens = ["the", "quick", "brown", "fox", "the", "?"]
positions = [0, 1, 2, 3, 4, 5]
stepDescriptions = [
"Initial sequence: We see 'the quick brown fox the ?' — What should follow the second 'the'?",
"Layer 1: Previous Token Head processes the sequence, recording what follows each token.",
"At position 1, it records: 'quick follows the' into the residual stream.",
"Layer 2: When we reach position 4 (second 'the'), the Induction Head searches...",
"It finds position 0 where 'the' appeared before, and checks what followed (position 1).",
"Result: 'quick' followed 'the' before, so predict 'quick' again! ✓"
]
// Visualization
{
const width = 700;
const height = 380;
const svg = d3.create("svg")
.attr("viewBox", [0, 0, width, height])
.attr("width", width)
.attr("height", height);
// Token display area
const tokenY = 60;
const tokenSpacing = 90;
const startX = 80;
// Draw tokens
tokens.forEach((token, i) => {
const x = startX + i * tokenSpacing;
// Highlight based on step
let fillColor = "#f5f5f5";
let strokeColor = "#999";
let strokeWidth = 1;
if (currentStep >= 1 && i <= 1) {
// Recording phase - highlight first "the" and "quick"
if (currentStep === 2 && (i === 0 || i === 1)) {
fillColor = "#e3f2fd";
strokeColor = "#1976d2";
strokeWidth = 2;
}
}
if (currentStep >= 3 && i === 4) {
// Searching phase - highlight second "the"
fillColor = "#fff3e0";
strokeColor = "#f57c00";
strokeWidth = 2;
}
if (currentStep >= 4 && i === 0) {
// Found match
fillColor = "#e8f5e9";
strokeColor = "#388e3c";
strokeWidth = 2;
}
if (currentStep >= 5 && i === 1) {
// Retrieving "quick"
fillColor = "#c8e6c9";
strokeColor = "#2e7d32";
strokeWidth = 3;
}
if (currentStep >= 5 && i === 5) {
// Prediction slot
fillColor = "#c8e6c9";
strokeColor = "#2e7d32";
strokeWidth = 3;
}
// Token box
svg.append("rect")
.attr("x", x - 35)
.attr("y", tokenY - 18)
.attr("width", 70)
.attr("height", 36)
.attr("rx", 6)
.attr("fill", fillColor)
.attr("stroke", strokeColor)
.attr("stroke-width", strokeWidth);
// Token text
svg.append("text")
.attr("x", x)
.attr("y", tokenY + 5)
.attr("text-anchor", "middle")
.attr("font-size", "14px")
.attr("font-weight", "bold")
.attr("fill", "#333")
.text(i === 5 && currentStep >= 5 ? "quick" : token);
// Position number
svg.append("text")
.attr("x", x)
.attr("y", tokenY + 28)
.attr("text-anchor", "middle")
.attr("font-size", "10px")
.attr("fill", "#666")
.text(`pos ${i}`);
});
// Draw arrows for pattern matching (step 4+)
if (currentStep >= 4) {
// Arrow from position 4 back to position 0
svg.append("path")
.attr("d", `M ${startX + 4*tokenSpacing} ${tokenY + 40}
Q ${startX + 2*tokenSpacing} ${tokenY + 90}
${startX + 0*tokenSpacing} ${tokenY + 40}`)
.attr("fill", "none")
.attr("stroke", "#f57c00")
.attr("stroke-width", 2)
.attr("marker-end", "url(#arrowhead)");
svg.append("text")
.attr("x", startX + 2*tokenSpacing)
.attr("y", tokenY + 100)
.attr("text-anchor", "middle")
.attr("font-size", "11px")
.attr("fill", "#f57c00")
.text("'the' matches! Check what followed...");
}
// Draw retrieval arrow (step 5)
if (currentStep >= 5) {
// Arrow from position 1 to prediction
svg.append("path")
.attr("d", `M ${startX + 1*tokenSpacing} ${tokenY - 30}
Q ${startX + 3*tokenSpacing} ${tokenY - 70}
${startX + 5*tokenSpacing} ${tokenY - 30}`)
.attr("fill", "none")
.attr("stroke", "#388e3c")
.attr("stroke-width", 2)
.attr("marker-end", "url(#arrowhead-green)");
svg.append("text")
.attr("x", startX + 3*tokenSpacing)
.attr("y", tokenY - 75)
.attr("text-anchor", "middle")
.attr("font-size", "11px")
.attr("fill", "#388e3c")
.text("Copy 'quick' to prediction!");
}
// Arrow markers
svg.append("defs").append("marker")
.attr("id", "arrowhead")
.attr("viewBox", "0 -5 10 10")
.attr("refX", 8)
.attr("markerWidth", 6)
.attr("markerHeight", 6)
.attr("orient", "auto")
.append("path")
.attr("d", "M0,-5L10,0L0,5")
.attr("fill", "#f57c00");
svg.append("defs").append("marker")
.attr("id", "arrowhead-green")
.attr("viewBox", "0 -5 10 10")
.attr("refX", 8)
.attr("markerWidth", 6)
.attr("markerHeight", 6)
.attr("orient", "auto")
.append("path")
.attr("d", "M0,-5L10,0L0,5")
.attr("fill", "#388e3c");
// Layer indicators
const layerY = 180;
// Layer 1 box
const layer1Active = currentStep >= 1 && currentStep <= 2;
svg.append("rect")
.attr("x", 50)
.attr("y", layerY)
.attr("width", 280)
.attr("height", 70)
.attr("rx", 8)
.attr("fill", layer1Active ? "#e3f2fd" : "#fafafa")
.attr("stroke", layer1Active ? "#1976d2" : "#ddd")
.attr("stroke-width", layer1Active ? 2 : 1);
svg.append("text")
.attr("x", 190)
.attr("y", layerY + 20)
.attr("text-anchor", "middle")
.attr("font-size", "12px")
.attr("font-weight", "bold")
.attr("fill", layer1Active ? "#1976d2" : "#666")
.text("Layer 1: Previous Token Head");
svg.append("text")
.attr("x", 190)
.attr("y", layerY + 40)
.attr("text-anchor", "middle")
.attr("font-size", "11px")
.attr("fill", "#666")
.text("Records: \"B follows A\"");
if (currentStep >= 2) {
svg.append("text")
.attr("x", 190)
.attr("y", layerY + 55)
.attr("text-anchor", "middle")
.attr("font-size", "11px")
.attr("fill", "#1976d2")
.text("→ \"quick follows the\"");
}
// Layer 2 box
const layer2Active = currentStep >= 3;
svg.append("rect")
.attr("x", 370)
.attr("y", layerY)
.attr("width", 280)
.attr("height", 70)
.attr("rx", 8)
.attr("fill", layer2Active ? "#fff3e0" : "#fafafa")
.attr("stroke", layer2Active ? "#f57c00" : "#ddd")
.attr("stroke-width", layer2Active ? 2 : 1);
svg.append("text")
.attr("x", 510)
.attr("y", layerY + 20)
.attr("text-anchor", "middle")
.attr("font-size", "12px")
.attr("font-weight", "bold")
.attr("fill", layer2Active ? "#f57c00" : "#666")
.text("Layer 2: Induction Head");
svg.append("text")
.attr("x", 510)
.attr("y", layerY + 40)
.attr("text-anchor", "middle")
.attr("font-size", "11px")
.attr("fill", "#666")
.text("Searches for matching previous token");
if (currentStep >= 5) {
svg.append("text")
.attr("x", 510)
.attr("y", layerY + 55)
.attr("text-anchor", "middle")
.attr("font-size", "11px")
.attr("fill", "#388e3c")
.text("→ Predicts: \"quick\"");
}
// Step description
svg.append("rect")
.attr("x", 30)
.attr("y", 290)
.attr("width", 640)
.attr("height", 60)
.attr("rx", 8)
.attr("fill", "#f5f5f5")
.attr("stroke", "#ddd");
svg.append("text")
.attr("x", 350)
.attr("y", 325)
.attr("text-anchor", "middle")
.attr("font-size", "13px")
.attr("fill", "#333")
.text(stepDescriptions[currentStep]);
return svg.node();
}19.3.3 Why Two Layers?
Transformers can’t implement induction in a single layer because attention is computed from the current residual stream state. At position \(j\), the model needs to: 1. Look backward for previous occurrences of the current token 2. To do this, it needs to know what the previous token at each earlier position was
But at position \(i\), the residual stream doesn’t natively contain “what’s the previous token?” information. Layer 1 writes this information into the stream, enabling layer 2 to use it.
This is K-composition (Chapter 8): layer 1’s output modifies layer 2’s keys, changing what layer 2 attends to.
19.4 Discovery Through Attention Patterns
The first clue that induction heads exist came from visualizing attention patterns.
19.4.1 The Signature Pattern
When you visualize what an induction head attends to, you see a distinctive diagonal stripe pattern:
Position: 0 1 2 3 4 5 6 7 8
Token: A B C A B ?
Layer 2 attention at position 5:
Position: 0 1 2 3 4 5 6 7 8
Attention: 0 █ 0 0 █ 0 0 0 0
The head at position 5 (token \(B\)) strongly attends to position 1 (the previous occurrence of \(B\)) where the prior token was \(A\).
This creates diagonal stripes across the attention matrix because:
- When processing position 5, attend to position 1 (offset -4)
- When processing position 6, attend to position 2 (offset -4)
- When processing position 7, attend to position 3 (offset -4)
The constant offset creates a diagonal.
19.4.2 Searching for Induction Heads
Researchers developed an induction score to automatically detect these heads:
- Create inputs with repeated sequences: “[random] [random] [random] [random]…”
- Measure whether head \(H\) at position \(i\) attends to position \(j\) where \(\text{token}[j-1] = \text{token}[i-1]\)
- High score → likely induction head
Finding: In every transformer language model tested (GPT-2, GPT-3, BLOOM, LLaMA), induction heads emerge in early layers (typically layers 1-2 for small models, proportionally early layers for larger models), comprising 5-15% of all attention heads. Crucially, the circuit requires at least two layers—a mathematical necessity proven by communication complexity arguments. Single-layer transformers would need exponentially larger models to solve induction tasks.
19.5 The Phase Transition
One of the most striking findings: induction heads don’t exist at initialization. They emerge suddenly during training.
19.5.1 The Training Dynamics
Anthropic researchers tracked training on a small model, measuring: - Induction score: How strong the diagonal attention pattern is - In-context learning performance: How well the model continues patterns
Both metrics show a sharp phase transition:
Training steps: 0 5K 10K 15K 20K
Induction score: 0.0 0.0 0.0 0.8 0.9
ICL accuracy: 20% 21% 22% 72% 85%
Around step 15K, both metrics spike simultaneously: - Induction heads suddenly develop the diagonal attention pattern - In-context learning capability suddenly improves
The transition isn’t gradual—it’s a discrete shift from “no induction heads” to “strong induction heads” over just a few thousand steps.
Induction heads aren’t present from the start—they’re discovered by gradient descent as a sharp improvement to the loss. Their sudden emergence suggests they’re a discrete algorithmic solution that the optimizer finds and implements quickly once conditions are right.
Recent research reveals that induction heads are just one phase in a sequence of algorithmic discoveries during training. Studies tracking circuit formation found models go through multiple distinct phases—developing token copying first, then pattern matching, then more sophisticated contextual algorithms. Each phase shows its own sharp transition. This suggests induction heads are a stepping stone, not a final destination: the model builds increasingly sophisticated circuits by composing simpler ones learned earlier.
19.5.2 What Causes the Transition?
The phase transition happens when: 1. Earlier layers have learned useful features (token identities, positional information) 2. The model has enough capacity to implement the two-layer circuit 3. Training data provides sufficient signal for the induction pattern
Before the transition: the model relies on simple heuristics (unigram frequencies, positional biases).
After the transition: the model uses genuine in-context learning via pattern matching.
19.6 Reverse-Engineering the Circuit
Let’s apply our interpretability toolkit to verify the induction head mechanism.
19.6.1 Step 1: Attribution
Which components contribute to induction predictions?
Run the model on: “the quick brown fox jumped over the quick → ___”
Measure logit attribution for “brown”:
| Component | Attribution to “brown” |
|---|---|
| Head 1.5 (previous token) | +0.8 |
| Head 2.3 (induction) | +2.4 |
| MLP layer 2 | +0.6 |
| Other heads | < 0.2 each |
Finding: Head 2.3 has high attribution. This is a candidate induction head.
19.6.2 Step 2: Attention Pattern Analysis
Visualize where head 2.3 attends:
When processing “the quick” (second occurrence), head 2.3 strongly attends to “the quick” (first occurrence) with offset matching the previous token pattern.
Confirmation: The attention pattern matches the induction head signature.
19.6.3 Step 3: Ablation
What happens if we remove head 2.3?
Ablate head 2.3 (mean ablation) and measure:
- Baseline: 87% accuracy on induction tasks
- Ablated: 23% accuracy
Finding: Performance collapses. Head 2.3 is necessary.
What about head 1.5 (previous token head)?
- Ablated head 1.5: 19% accuracy
Finding: Head 1.5 is also necessary. Both components of the circuit are critical.
19.6.4 Step 4: Patching
Clean: “A B C D E F A B → C” Corrupted: “A B C D E F X Y → ?”
Patch head 2.3’s output from clean to corrupted:
- Corrupted logit for “C”: -1.2
- Patched logit for “C”: +0.8
Recovery: 80%+. Patching head 2.3 largely restores the correct prediction.
Path patching: Does head 1.5 → head 2.3 path matter?
Patch head 1.5’s contribution to head 2.3’s keys:
- Recovery: 65%
Finding: The connection from head 1.5 to head 2.3 is causally important. This confirms the two-layer circuit.
19.6.5 Step 5: Feature Analysis (with SAEs)
Train a sparse autoencoder on layer 2 activations.
Which features activate strongly during induction?
- Feature 1,248: “Repeated token detection”
- Feature 3,891: “Previous position offset”
- Feature 7,102: “Copy operation”
Ablating feature 1,248: Accuracy drops from 87% to 31%.
Finding: Specific features encode the components of the induction algorithm.
19.7 The Complete Circuit Diagram
Putting it all together:
Position i: [token A]
↓
Layer 1: Previous Token Head (1.5)
Writes "token A followed by ___" to residual stream
↓
Position i+1: [token B]
Reads residual stream
Associates "A → B"
↓
...later...
↓
Position j: [token A] (repeated)
↓
Layer 2: Induction Head (2.3)
Query: "Where did A appear before?"
Key matching: Finds position i (using previous token info from head 1.5)
Value: Retrieves token at position i+1
↓
Output: Predict "B"
Verification: - Attribution: ✓ (head 2.3 has highest attribution) - Attention: ✓ (diagonal stripe pattern) - Ablation: ✓ (removing either head breaks the circuit) - Patching: ✓ (patching restores behavior) - Features: ✓ (interpretable features encode the algorithm)
This is a fully reverse-engineered circuit.
19.8 Generalizations and Variations
Induction heads aren’t a single universal algorithm—they’re a family of related circuits.
19.8.1 Fuzzy Matching
Some induction heads don’t require exact token matches. They trigger on: - Semantic similarity (“Paris” → “France” matches “Berlin” → “Germany”) - Structural similarity (matching syntax, not content)
These “fuzzy induction heads” enable more sophisticated in-context learning.
19.8.2 Multi-Token Patterns
Some induction heads track longer sequences: [A][B][C] … [A][B] → [C]
These enable learning from richer context.
19.8.3 Position-Dependent Induction
Some heads combine induction with positional information: - “This token appeared \(k\) positions ago” - “Copy, but only if within the last \(n\) tokens”
These add constraints to the copying mechanism.
19.8.4 Translation Induction
In multilingual models: - “French word X translates to English word Y” - Later: “French word Z translates to…” → retrieve the translation pattern
This is induction across languages.
19.9 Why Induction Heads Matter
Induction heads are foundational:
19.9.1 1. They Enable In-Context Learning
The core capability that makes few-shot prompting work. Without induction heads, language models couldn’t generalize from examples in context.
19.9.2 2. They Emerge Reliably
Every large language model develops induction heads. This suggests they’re a convergent solution—gradient descent discovers them independently across architectures, scales, and training regimes.
19.9.3 3. They’re Understandable
Unlike most neural network behaviors, the induction circuit is: - Localizable (specific heads in specific layers) - Interpretable (the algorithm is clear) - Verifiable (all techniques confirm the mechanism)
This makes induction heads the best-understood capability in transformers.
If you want to see induction heads directly, TransformerLens makes it straightforward:
import transformer_lens as tl
model = tl.HookedTransformer.from_pretrained("gpt2-small")
# Create a sequence with repetition
prompt = "A B C D E A B C D" # Should predict "E"
# Get attention patterns
_, cache = model.run_with_cache(prompt)
# Look at layer 2 (where induction heads live in GPT-2)
# Induction heads show diagonal stripes in their attention patterns
attn = cache["pattern", 2] # Shape: (batch, head, query_pos, key_pos)Look for heads with strong attention to positions where the previous token matches the current token’s previous token. That’s the induction signature.
19.9.4 4. They Demonstrate Composition
The circuit requires two layers working together—K-composition between previous token heads and induction heads. This is proof that transformers build complex algorithms by composing simple components.
19.10 Connections to Broader Capabilities
Induction heads aren’t isolated—they connect to many model capabilities.
19.10.1 Translation
Parallel corpus learning: “French: bonjour. English: hello. French: merci. English: → thank you”
Induction pattern: [source language token] → [target language token]
19.10.2 Code Completion
Pattern: Function signature → function body
19.10.3 Analogical Reasoning
“King is to Queen as Man is to → Woman”
This is induction across semantic spaces.
19.10.4 Instruction Following
“Q: What is 2+2? A: 4. Q: What is 3+3? A: → 6”
The Q-A structure is learned via induction.
Many “emergent capabilities” may be sophisticated applications of induction heads. The basic copying circuit, combined with semantic features, enables learning from examples across domains.
19.11 Limitations and Open Questions
Despite being well-understood, induction heads leave questions unanswered:
19.11.1 What’s the Capacity Limit?
How many patterns can induction heads track simultaneously? Early experiments suggest ~10-20, but this varies by model and context length.
19.11.2 How Do They Interact with Other Circuits?
Induction heads are part of a larger system. How do they interact with: - Factual recall circuits - Reasoning circuits - Output formatting circuits
The interfaces aren’t fully mapped.
19.11.3 Why This Algorithm?
Gradient descent discovered induction heads, but are they optimal? Could there be better algorithms for in-context learning that transformers haven’t found?
19.11.4 Do They Scale?
Induction heads are clear in small models (GPT-2, 124M parameters). In large models (70B+ parameters), are the circuits still as clean? Early evidence suggests more redundancy and fuzzier boundaries.
The induction head circuit is remarkably well-understood—for GPT-2 Small. Here’s an honest calibration:
What we know well: - The two-layer circuit (previous token + induction head) in 124M-1B parameter models - The phase transition during training - That every tested model develops some form of induction heads
What we know less well: - Exact circuit details in 70B+ parameter production models - How induction heads interact with other circuits in complex prompts - Whether the clean two-head story holds at scale or becomes messier
What we don’t know: - The capacity limits (how many patterns simultaneously?) - Whether there are better algorithms the models haven’t found - How much in-context learning is induction heads vs. other mechanisms
Numbers without error bars: The accuracy numbers in this chapter (87% baseline, 23% ablated) are illustrative, not from a single definitive study. Real numbers vary by model, prompt, and measurement method. When replicating, expect variance.
The induction head story is the best story we have about any transformer circuit. It’s also incomplete. Both facts are important.
19.12 Polya’s Perspective: Worked Example
This chapter applies Polya’s heuristic: study worked examples.
Before trying to reverse-engineer every capability, understand one capability completely. Induction heads are that worked example: - Well-defined behavior - Discoverable circuit - Verifiable mechanism - Applicable techniques
Once you’ve reverse-engineered one circuit completely, you have a template for reverse-engineering others. The process (attribution → attention analysis → ablation → patching → features → circuit diagram) transfers.
“Study solutions to related problems.” You can’t learn proof techniques by reading theory alone—you need worked examples. Induction heads are the worked example for mechanistic interpretability. Master this case, then apply the approach to other circuits.
19.13 Looking Ahead
We’ve now seen the full interpretability workflow in action, applied to a real capability.
But interpretability research is incomplete. Many fundamental questions remain open:
- How much of model behavior can we explain with circuits?
- What capabilities resist circuit-based explanation?
- How do we scale interpretability to 100B+ parameter models?
- Can we use interpretability to improve safety and alignment?
These questions are the subject of the next chapter: Open Problems in Mechanistic Interpretability.
After that, we’ll close with A Practice Regime—concrete advice for how to actually do interpretability research, from choosing problems to debugging circuits to publishing results.
19.14 Key Takeaways
┌────────────────────────────────────────────────────────────┐
│ INDUCTION HEADS: A Complete Case Study │
├────────────────────────────────────────────────────────────┤
│ │
│ WHAT THEY DO: Enable in-context learning (few-shot) │
│ Pattern: [A][B]...[A] → predict [B] │
│ │
│ THE CIRCUIT (2 layers): │
│ Layer 1: Previous Token Head │
│ → Records "B follows A" in residual stream │
│ Layer 2: Induction Head │
│ → Finds where A appeared, retrieves what │
│ followed, predicts it will repeat │
│ │
│ KEY FINDINGS: │
│ • Phase transition: emerges SUDDENLY during training │
│ • Found in ALL transformer LLMs tested │
│ • K-composition: Layer 1 output → Layer 2 keys │
│ │
│ VERIFICATION CHECKLIST: │
│ ✓ Attribution (high logit contribution) │
│ ✓ Attention pattern (diagonal stripe) │
│ ✓ Ablation (removing breaks the circuit) │
│ ✓ Patching (restoring recovers behavior) │
│ ✓ Features (interpretable SAE features) │
│ │
│ WHY IT MATTERS: │
│ Best-understood circuit in transformers │
│ Template for reverse-engineering other capabilities │
│ │
└────────────────────────────────────────────────────────────┘
19.15 Check Your Understanding
Answer: At any position, attention is computed based on the current residual stream state. To implement induction, the model needs to: 1. Look backward for previous occurrences of the current token 2. Know what the previous token was at each earlier position
But the residual stream at position \(i\) doesn’t natively contain “what was the previous token?” information. Layer 1 writes this information into the stream, enabling Layer 2 to use it. This is K-composition: Layer 1’s output modifies Layer 2’s keys, changing what Layer 2 attends to. Single-layer transformers would need exponentially larger models to solve induction tasks.
Answer: When you visualize an induction head’s attention, you see positions attending to earlier positions with a constant offset. For example: - Position 5 attends to position 1 (offset -4) - Position 6 attends to position 2 (offset -4) - Position 7 attends to position 3 (offset -4)
This creates diagonal stripes across the attention matrix. The pattern emerges because the head is looking for “positions where the previous token matches my previous token”—and in repeated sequences, these matches occur at consistent offsets. Researchers use an induction score (measuring this diagonal pattern) to automatically detect induction heads.
Answer: Induction heads don’t exist at initialization—they emerge suddenly around a specific training step. Both the diagonal attention pattern and in-context learning performance spike simultaneously within just a few thousand steps, not gradually.
This tells us: 1. Induction heads are a discrete algorithmic solution, not a gradual improvement 2. Gradient descent “discovers” this algorithm once conditions are right (useful features in early layers, sufficient capacity) 3. Before the transition: model uses simple heuristics (unigram frequencies) 4. After: genuine in-context learning via pattern matching
The sharpness suggests induction heads are a qualitative leap in capability, not just quantitative improvement.
19.16 Further Reading
In-Context Learning and Induction Heads — Anthropic: The definitive paper on induction heads, including the phase transition discovery.
A Mathematical Framework for Transformer Circuits — Anthropic: The theoretical foundations for understanding composition in transformers.
Progress Measures for Grokking — arXiv:2301.05217: Analysis of the phase transition and what causes sudden capability emergence.
Induction Head Replication — Neel Nanda: Step-by-step guide to finding induction heads in any transformer.
The Quantization Model of Neural Scaling — arXiv:2303.13506: Theoretical framework explaining why capabilities emerge suddenly (phase transitions).
Transformer Circuits Thread — Anthropic: Collection of papers reverse-engineering transformer circuits, with induction heads as a central example.