19 Induction Heads

A complete case study

synthesis

circuits

case-study

Author

Taras Tsugrii

Published

January 5, 2025

Hands-On Notebook

Find induction heads, visualize their attention patterns, ablate them and measure the effect.

What You’ll Learn

How induction heads enable in-context learning (few-shot prompting)
The two-layer circuit: previous-token head + induction head
The phase transition: how induction heads emerge suddenly during training
How to apply attribution, patching, and ablation to verify a circuit

Prerequisites

Required: All of Arc II (Features, Superposition, Circuits) and Arc III (SAEs, Attribution, Patching, Ablation) This chapter synthesizes everything you’ve learned.

Before You Read: Recall

From Arc II & III, recall:

Features are directions; circuits are how they compose (Chapters 5, 8)
SAEs extract interpretable features from superposition (Chapter 9)
Attribution finds candidates; patching tests causation; ablation tests necessity (Chapters 10-12)

We have theory and tools. Now we apply everything: reverse-engineer a complete circuit.

19.1 The Complete Picture

We’ve built a complete interpretability toolkit:

Theory (Arc II): - Features as directions - Superposition and sparsity - Circuits as composable algorithms

Techniques (Arc III): - SAEs for extracting features - Attribution for finding correlations - Patching for testing causation - Ablation for identifying necessity

Now we apply everything to understand a single capability: in-context learning—the ability of language models to learn from examples within a prompt, without any gradient updates.

This chapter is a complete case study. We’ll reverse-engineer the circuit responsible for this capability, using every technique we’ve learned.

The Target Capability

In-context learning: Given examples in a prompt, the model predicts continuations that follow the demonstrated pattern—without any training.

Example: “The cat chased the mouse. The dog chased the ___” → “bone” (following the “[animal] chased [prey/toy]” pattern)

How does this work? The answer is induction heads.

19.2 The Pattern

Induction heads detect and continue copying patterns.

The simplest case:

Input:  [A] [B] ... [A] → ?
Output: [B]

When the model sees token $A$ followed later by token $B$, then encounters $A$ again, it predicts $B$.

19.2.1 Concrete Examples

Exact copying:

the quick brown fox jumped over the quick → brown

Pattern generalization:

When John went to the store, John → when

(The model predicts “when” would follow “John” the second time, mirroring the structure after the first “John”)

In-context learning:

A: Paris. Q: France. A: Berlin. Q: Germany. A: Madrid. Q: → Spain

(The model learns the Q-A pattern from examples in the prompt)

All three involve the same core mechanism: detecting a repeated token and retrieving what followed it previously.

Why This Matters

Induction heads are the primary mechanism for few-shot learning in transformers. When you show GPT-4 three examples and it generalizes on the fourth, induction heads are doing much of the work. Understanding induction heads is understanding how transformers learn from context.

19.3 The Two-Layer Circuit

Induction heads aren’t a single component—they’re a circuit involving multiple attention heads across two layers.

The Discovery Story

In 2022, Catherine Olsson, Nelson Elhage, and collaborators at Anthropic were tracking how language models develop capabilities during training. They noticed something strange: at a specific point in training, both in-context learning ability and a distinctive attention pattern appeared simultaneously—within just a few thousand steps. When they investigated, they found a beautiful two-layer circuit. The first layer records “what follows each token.” The second layer uses that information to copy patterns. They called these “induction heads” because they perform induction—generalizing from examples. The discovery was published as “In-context Learning and Induction Heads” and remains one of the clearest examples of reverse-engineering a neural network capability.

19.3.1 The Algorithm

Layer 1: Previous Token Head - Attention pattern: Each position attends to the previous token - Effect: Writes information to the residual stream indicating “token $B$ came after token $A$”

Layer 2: Induction Head - Attention pattern: Looks for positions where the previous token matches the current token - Effect: When it finds a match, it copies the token that followed

The composition: 1. At position $i$, token is $A$ 2. Previous token head at position $i+1$ writes “$B$ follows $A$” into the residual stream 3. Later, at position $j$, token is again $A$ 4. Induction head at position $j$ searches for positions where the previous token was also $A$ 5. Finds position $i$ (where $A$ occurred before) 6. Copies the token at position $i+1$, which is $B$

The Two-Head Composition

Previous token head: “Record what came before” Induction head: “Find where this token appeared before and copy what followed”

Together: pattern matching and retrieval.

flowchart LR
    A["Token A"] --> PTH["Previous Token Head<br/>(Layer 1)"]
    B["Token B"] --> PTH
    PTH --> RS["Residual Stream:<br/>'B follows A'"]
    RS --> IH["Induction Head<br/>(Layer 2)"]
    A2["Token A<br/>(repeated)"] --> IH
    IH --> OUT["Predict: B"]

The induction circuit: Layer 1 records ‘B follows A’, Layer 2 uses this to predict B when A repeats.

19.3.2 Interactive: Watch the Induction Algorithm

Step through the induction head algorithm to see how two layers work together. The Previous Token Head records what follows each token, and the Induction Head uses this to predict continuations when tokens repeat.

Code

viewof currentStep = Inputs.range([0, 5], {step: 1, value: 0, label: "Algorithm step"})

tokens = ["the", "quick", "brown", "fox", "the", "?"]
positions = [0, 1, 2, 3, 4, 5]

stepDescriptions = [
  "Initial sequence: We see 'the quick brown fox the ?' — What should follow the second 'the'?",
  "Layer 1: Previous Token Head processes the sequence, recording what follows each token.",
  "At position 1, it records: 'quick follows the' into the residual stream.",
  "Layer 2: When we reach position 4 (second 'the'), the Induction Head searches...",
  "It finds position 0 where 'the' appeared before, and checks what followed (position 1).",
  "Result: 'quick' followed 'the' before, so predict 'quick' again! ✓"
]

// Visualization
{
  const width = 700;
  const height = 380;

  const svg = d3.create("svg")
    .attr("viewBox", [0, 0, width, height])
    .attr("width", width)
    .attr("height", height);

  // Token display area
  const tokenY = 60;
  const tokenSpacing = 90;
  const startX = 80;

  // Draw tokens
  tokens.forEach((token, i) => {
    const x = startX + i * tokenSpacing;

    // Highlight based on step
    let fillColor = "#f5f5f5";
    let strokeColor = "#999";
    let strokeWidth = 1;

    if (currentStep >= 1 && i <= 1) {
      // Recording phase - highlight first "the" and "quick"
      if (currentStep === 2 && (i === 0 || i === 1)) {
        fillColor = "#e3f2fd";
        strokeColor = "#1976d2";
        strokeWidth = 2;
      }
    }
    if (currentStep >= 3 && i === 4) {
      // Searching phase - highlight second "the"
      fillColor = "#fff3e0";
      strokeColor = "#f57c00";
      strokeWidth = 2;
    }
    if (currentStep >= 4 && i === 0) {
      // Found match
      fillColor = "#e8f5e9";
      strokeColor = "#388e3c";
      strokeWidth = 2;
    }
    if (currentStep >= 5 && i === 1) {
      // Retrieving "quick"
      fillColor = "#c8e6c9";
      strokeColor = "#2e7d32";
      strokeWidth = 3;
    }
    if (currentStep >= 5 && i === 5) {
      // Prediction slot
      fillColor = "#c8e6c9";
      strokeColor = "#2e7d32";
      strokeWidth = 3;
    }

    // Token box
    svg.append("rect")
      .attr("x", x - 35)
      .attr("y", tokenY - 18)
      .attr("width", 70)
      .attr("height", 36)
      .attr("rx", 6)
      .attr("fill", fillColor)
      .attr("stroke", strokeColor)
      .attr("stroke-width", strokeWidth);

    // Token text
    svg.append("text")
      .attr("x", x)
      .attr("y", tokenY + 5)
      .attr("text-anchor", "middle")
      .attr("font-size", "14px")
      .attr("font-weight", "bold")
      .attr("fill", "#333")
      .text(i === 5 && currentStep >= 5 ? "quick" : token);

    // Position number
    svg.append("text")
      .attr("x", x)
      .attr("y", tokenY + 28)
      .attr("text-anchor", "middle")
      .attr("font-size", "10px")
      .attr("fill", "#666")
      .text(`pos ${i}`);
  });

  // Draw arrows for pattern matching (step 4+)
  if (currentStep >= 4) {
    // Arrow from position 4 back to position 0
    svg.append("path")
      .attr("d", `M ${startX + 4*tokenSpacing} ${tokenY + 40}
                  Q ${startX + 2*tokenSpacing} ${tokenY + 90}
                  ${startX + 0*tokenSpacing} ${tokenY + 40}`)
      .attr("fill", "none")
      .attr("stroke", "#f57c00")
      .attr("stroke-width", 2)
      .attr("marker-end", "url(#arrowhead)");

    svg.append("text")
      .attr("x", startX + 2*tokenSpacing)
      .attr("y", tokenY + 100)
      .attr("text-anchor", "middle")
      .attr("font-size", "11px")
      .attr("fill", "#f57c00")
      .text("'the' matches! Check what followed...");
  }

  // Draw retrieval arrow (step 5)
  if (currentStep >= 5) {
    // Arrow from position 1 to prediction
    svg.append("path")
      .attr("d", `M ${startX + 1*tokenSpacing} ${tokenY - 30}
                  Q ${startX + 3*tokenSpacing} ${tokenY - 70}
                  ${startX + 5*tokenSpacing} ${tokenY - 30}`)
      .attr("fill", "none")
      .attr("stroke", "#388e3c")
      .attr("stroke-width", 2)
      .attr("marker-end", "url(#arrowhead-green)");

    svg.append("text")
      .attr("x", startX + 3*tokenSpacing)
      .attr("y", tokenY - 75)
      .attr("text-anchor", "middle")
      .attr("font-size", "11px")
      .attr("fill", "#388e3c")
      .text("Copy 'quick' to prediction!");
  }

  // Arrow markers
  svg.append("defs").append("marker")
    .attr("id", "arrowhead")
    .attr("viewBox", "0 -5 10 10")
    .attr("refX", 8)
    .attr("markerWidth", 6)
    .attr("markerHeight", 6)
    .attr("orient", "auto")
    .append("path")
    .attr("d", "M0,-5L10,0L0,5")
    .attr("fill", "#f57c00");

  svg.append("defs").append("marker")
    .attr("id", "arrowhead-green")
    .attr("viewBox", "0 -5 10 10")
    .attr("refX", 8)
    .attr("markerWidth", 6)
    .attr("markerHeight", 6)
    .attr("orient", "auto")
    .append("path")
    .attr("d", "M0,-5L10,0L0,5")
    .attr("fill", "#388e3c");

  // Layer indicators
  const layerY = 180;

  // Layer 1 box
  const layer1Active = currentStep >= 1 && currentStep <= 2;
  svg.append("rect")
    .attr("x", 50)
    .attr("y", layerY)
    .attr("width", 280)
    .attr("height", 70)
    .attr("rx", 8)
    .attr("fill", layer1Active ? "#e3f2fd" : "#fafafa")
    .attr("stroke", layer1Active ? "#1976d2" : "#ddd")
    .attr("stroke-width", layer1Active ? 2 : 1);

  svg.append("text")
    .attr("x", 190)
    .attr("y", layerY + 20)
    .attr("text-anchor", "middle")
    .attr("font-size", "12px")
    .attr("font-weight", "bold")
    .attr("fill", layer1Active ? "#1976d2" : "#666")
    .text("Layer 1: Previous Token Head");

  svg.append("text")
    .attr("x", 190)
    .attr("y", layerY + 40)
    .attr("text-anchor", "middle")
    .attr("font-size", "11px")
    .attr("fill", "#666")
    .text("Records: \"B follows A\"");

  if (currentStep >= 2) {
    svg.append("text")
      .attr("x", 190)
      .attr("y", layerY + 55)
      .attr("text-anchor", "middle")
      .attr("font-size", "11px")
      .attr("fill", "#1976d2")
      .text("→ \"quick follows the\"");
  }

  // Layer 2 box
  const layer2Active = currentStep >= 3;
  svg.append("rect")
    .attr("x", 370)
    .attr("y", layerY)
    .attr("width", 280)
    .attr("height", 70)
    .attr("rx", 8)
    .attr("fill", layer2Active ? "#fff3e0" : "#fafafa")
    .attr("stroke", layer2Active ? "#f57c00" : "#ddd")
    .attr("stroke-width", layer2Active ? 2 : 1);

  svg.append("text")
    .attr("x", 510)
    .attr("y", layerY + 20)
    .attr("text-anchor", "middle")
    .attr("font-size", "12px")
    .attr("font-weight", "bold")
    .attr("fill", layer2Active ? "#f57c00" : "#666")
    .text("Layer 2: Induction Head");

  svg.append("text")
    .attr("x", 510)
    .attr("y", layerY + 40)
    .attr("text-anchor", "middle")
    .attr("font-size", "11px")
    .attr("fill", "#666")
    .text("Searches for matching previous token");

  if (currentStep >= 5) {
    svg.append("text")
      .attr("x", 510)
      .attr("y", layerY + 55)
      .attr("text-anchor", "middle")
      .attr("font-size", "11px")
      .attr("fill", "#388e3c")
      .text("→ Predicts: \"quick\"");
  }

  // Step description
  svg.append("rect")
    .attr("x", 30)
    .attr("y", 290)
    .attr("width", 640)
    .attr("height", 60)
    .attr("rx", 8)
    .attr("fill", "#f5f5f5")
    .attr("stroke", "#ddd");

  svg.append("text")
    .attr("x", 350)
    .attr("y", 325)
    .attr("text-anchor", "middle")
    .attr("font-size", "13px")
    .attr("fill", "#333")
    .text(stepDescriptions[currentStep]);

  return svg.node();
}

(a)

(b)

(c)

(d)

(e)

Figure 19.1: Step through the induction head algorithm. Watch how Layer 1 records patterns and Layer 2 retrieves them.

19.3.3 Why Two Layers?

Transformers can’t implement induction in a single layer because attention is computed from the current residual stream state. At position $j$, the model needs to: 1. Look backward for previous occurrences of the current token 2. To do this, it needs to know what the previous token at each earlier position was

But at position $i$, the residual stream doesn’t natively contain “what’s the previous token?” information. Layer 1 writes this information into the stream, enabling layer 2 to use it.

This is K-composition (Chapter 8): layer 1’s output modifies layer 2’s keys, changing what layer 2 attends to.

19.4 Discovery Through Attention Patterns

The first clue that induction heads exist came from visualizing attention patterns.

19.4.1 The Signature Pattern

When you visualize what an induction head attends to, you see a distinctive diagonal stripe pattern:

Position:  0  1  2  3  4  5  6  7  8
Token:     A  B  C  A  B  ?

Layer 2 attention at position 5:
Position:  0  1  2  3  4  5  6  7  8
Attention: 0  █  0  0  █  0  0  0  0

The head at position 5 (token $B$) strongly attends to position 1 (the previous occurrence of $B$) where the prior token was $A$.

This creates diagonal stripes across the attention matrix because:

When processing position 5, attend to position 1 (offset -4)
When processing position 6, attend to position 2 (offset -4)
When processing position 7, attend to position 3 (offset -4)

The constant offset creates a diagonal.

19.4.2 Searching for Induction Heads

Researchers developed an induction score to automatically detect these heads:

Create inputs with repeated sequences: “[random] [random] [random] [random]…”
Measure whether head $H$ at position $i$ attends to position $j$ where $\text{token}[j-1] = \text{token}[i-1]$
High score → likely induction head

Finding: In every transformer language model tested (GPT-2, GPT-3, BLOOM, LLaMA), induction heads emerge in early layers (typically layers 1-2 for small models, proportionally early layers for larger models), comprising 5-15% of all attention heads. Crucially, the circuit requires at least two layers—a mathematical necessity proven by communication complexity arguments. Single-layer transformers would need exponentially larger models to solve induction tasks.

19.5 The Phase Transition

One of the most striking findings: induction heads don’t exist at initialization. They emerge suddenly during training.

19.5.1 The Training Dynamics

Anthropic researchers tracked training on a small model, measuring: - Induction score: How strong the diagonal attention pattern is - In-context learning performance: How well the model continues patterns

Both metrics show a sharp phase transition:

Training steps:    0      5K     10K    15K    20K
Induction score:   0.0    0.0    0.0    0.8    0.9
ICL accuracy:      20%    21%    22%    72%    85%

Around step 15K, both metrics spike simultaneously: - Induction heads suddenly develop the diagonal attention pattern - In-context learning capability suddenly improves

The transition isn’t gradual—it’s a discrete shift from “no induction heads” to “strong induction heads” over just a few thousand steps.

The Phase Transition Insight

Induction heads aren’t present from the start—they’re discovered by gradient descent as a sharp improvement to the loss. Their sudden emergence suggests they’re a discrete algorithmic solution that the optimizer finds and implements quickly once conditions are right.

Beyond Induction: Multi-Phase Emergence (2024)

Recent research reveals that induction heads are just one phase in a sequence of algorithmic discoveries during training. Studies tracking circuit formation found models go through multiple distinct phases—developing token copying first, then pattern matching, then more sophisticated contextual algorithms. Each phase shows its own sharp transition. This suggests induction heads are a stepping stone, not a final destination: the model builds increasingly sophisticated circuits by composing simpler ones learned earlier.

19.5.2 What Causes the Transition?

The phase transition happens when: 1. Earlier layers have learned useful features (token identities, positional information) 2. The model has enough capacity to implement the two-layer circuit 3. Training data provides sufficient signal for the induction pattern

Before the transition: the model relies on simple heuristics (unigram frequencies, positional biases).

After the transition: the model uses genuine in-context learning via pattern matching.

19.6 Reverse-Engineering the Circuit

Let’s apply our interpretability toolkit to verify the induction head mechanism.

19.6.1 Step 1: Attribution

Which components contribute to induction predictions?

Run the model on: “the quick brown fox jumped over the quick → ___”

Measure logit attribution for “brown”:

Component	Attribution to “brown”
Head 1.5 (previous token)	+0.8
Head 2.3 (induction)	+2.4
MLP layer 2	+0.6
Other heads	< 0.2 each

Finding: Head 2.3 has high attribution. This is a candidate induction head.

19.6.2 Step 2: Attention Pattern Analysis

Visualize where head 2.3 attends:

When processing “the quick” (second occurrence), head 2.3 strongly attends to “the quick” (first occurrence) with offset matching the previous token pattern.

Confirmation: The attention pattern matches the induction head signature.

19.6.3 Step 3: Ablation

What happens if we remove head 2.3?

Ablate head 2.3 (mean ablation) and measure:

Baseline: 87% accuracy on induction tasks
Ablated: 23% accuracy

Finding: Performance collapses. Head 2.3 is necessary.

What about head 1.5 (previous token head)?

Ablated head 1.5: 19% accuracy

Finding: Head 1.5 is also necessary. Both components of the circuit are critical.

19.6.4 Step 4: Patching

Clean: “A B C D E F A B → C” Corrupted: “A B C D E F X Y → ?”

Patch head 2.3’s output from clean to corrupted:

Corrupted logit for “C”: -1.2
Patched logit for “C”: +0.8

Recovery: 80%+. Patching head 2.3 largely restores the correct prediction.

Path patching: Does head 1.5 → head 2.3 path matter?

Patch head 1.5’s contribution to head 2.3’s keys:

Recovery: 65%

Finding: The connection from head 1.5 to head 2.3 is causally important. This confirms the two-layer circuit.

19.6.5 Step 5: Feature Analysis (with SAEs)

Train a sparse autoencoder on layer 2 activations.

Which features activate strongly during induction?

Feature 1,248: “Repeated token detection”
Feature 3,891: “Previous position offset”
Feature 7,102: “Copy operation”

Ablating feature 1,248: Accuracy drops from 87% to 31%.

Finding: Specific features encode the components of the induction algorithm.

19.7 The Complete Circuit Diagram

Putting it all together:

Position i: [token A]
           ↓
Layer 1: Previous Token Head (1.5)
         Writes "token A followed by ___" to residual stream
           ↓
Position i+1: [token B]
         Reads residual stream
         Associates "A → B"
           ↓
...later...
           ↓
Position j: [token A] (repeated)
           ↓
Layer 2: Induction Head (2.3)
         Query: "Where did A appear before?"
         Key matching: Finds position i (using previous token info from head 1.5)
         Value: Retrieves token at position i+1
           ↓
Output: Predict "B"

Verification: - Attribution: ✓ (head 2.3 has highest attribution) - Attention: ✓ (diagonal stripe pattern) - Ablation: ✓ (removing either head breaks the circuit) - Patching: ✓ (patching restores behavior) - Features: ✓ (interpretable features encode the algorithm)

This is a fully reverse-engineered circuit.

19.8 Generalizations and Variations

Induction heads aren’t a single universal algorithm—they’re a family of related circuits.

19.8.1 Fuzzy Matching

Some induction heads don’t require exact token matches. They trigger on: - Semantic similarity (“Paris” → “France” matches “Berlin” → “Germany”) - Structural similarity (matching syntax, not content)

These “fuzzy induction heads” enable more sophisticated in-context learning.

19.8.2 Multi-Token Patterns

Some induction heads track longer sequences: [A][B][C] … [A][B] → [C]

These enable learning from richer context.

19.8.3 Position-Dependent Induction

Some heads combine induction with positional information: - “This token appeared $k$ positions ago” - “Copy, but only if within the last $n$ tokens”

These add constraints to the copying mechanism.

19.8.4 Translation Induction

In multilingual models: - “French word X translates to English word Y” - Later: “French word Z translates to…” → retrieve the translation pattern

This is induction across languages.

19.9 Why Induction Heads Matter

Induction heads are foundational:

19.9.1 1. They Enable In-Context Learning

The core capability that makes few-shot prompting work. Without induction heads, language models couldn’t generalize from examples in context.

19.9.2 2. They Emerge Reliably

Every large language model develops induction heads. This suggests they’re a convergent solution—gradient descent discovers them independently across architectures, scales, and training regimes.

19.9.3 3. They’re Understandable

Unlike most neural network behaviors, the induction circuit is: - Localizable (specific heads in specific layers) - Interpretable (the algorithm is clear) - Verifiable (all techniques confirm the mechanism)

This makes induction heads the best-understood capability in transformers.

Try It Yourself: Find Induction Heads in TransformerLens

If you want to see induction heads directly, TransformerLens makes it straightforward:

import transformer_lens as tl
model = tl.HookedTransformer.from_pretrained("gpt2-small")

# Create a sequence with repetition
prompt = "A B C D E A B C D"  # Should predict "E"

# Get attention patterns
_, cache = model.run_with_cache(prompt)

# Look at layer 2 (where induction heads live in GPT-2)
# Induction heads show diagonal stripes in their attention patterns
attn = cache["pattern", 2]  # Shape: (batch, head, query_pos, key_pos)

Look for heads with strong attention to positions where the previous token matches the current token’s previous token. That’s the induction signature.

19.9.4 4. They Demonstrate Composition

The circuit requires two layers working together—K-composition between previous token heads and induction heads. This is proof that transformers build complex algorithms by composing simple components.

19.10 Connections to Broader Capabilities

Induction heads aren’t isolated—they connect to many model capabilities.

19.10.1 Translation

Parallel corpus learning: “French: bonjour. English: hello. French: merci. English: → thank you”

Induction pattern: [source language token] → [target language token]

19.10.2 Code Completion

Pattern: Function signature → function body

def add(a, b):
    return a + b

def multiply(a, b):
    return → a * b

19.10.3 Analogical Reasoning

“King is to Queen as Man is to → Woman”

This is induction across semantic spaces.

19.10.4 Instruction Following

“Q: What is 2+2? A: 4. Q: What is 3+3? A: → 6”

The Q-A structure is learned via induction.

The Unifying Principle

Many “emergent capabilities” may be sophisticated applications of induction heads. The basic copying circuit, combined with semantic features, enables learning from examples across domains.

19.11 Limitations and Open Questions

Despite being well-understood, induction heads leave questions unanswered:

19.11.1 What’s the Capacity Limit?

How many patterns can induction heads track simultaneously? Early experiments suggest ~10-20, but this varies by model and context length.

19.11.2 How Do They Interact with Other Circuits?

Induction heads are part of a larger system. How do they interact with: - Factual recall circuits - Reasoning circuits - Output formatting circuits

The interfaces aren’t fully mapped.

19.11.3 Why This Algorithm?

Gradient descent discovered induction heads, but are they optimal? Could there be better algorithms for in-context learning that transformers haven’t found?

19.11.4 Do They Scale?

Induction heads are clear in small models (GPT-2, 124M parameters). In large models (70B+ parameters), are the circuits still as clean? Early evidence suggests more redundancy and fuzzier boundaries.

Honest Assessment: Scope of These Findings

The induction head circuit is remarkably well-understood—for GPT-2 Small. Here’s an honest calibration:

What we know well: - The two-layer circuit (previous token + induction head) in 124M-1B parameter models - The phase transition during training - That every tested model develops some form of induction heads

What we know less well: - Exact circuit details in 70B+ parameter production models - How induction heads interact with other circuits in complex prompts - Whether the clean two-head story holds at scale or becomes messier

What we don’t know: - The capacity limits (how many patterns simultaneously?) - Whether there are better algorithms the models haven’t found - How much in-context learning is induction heads vs. other mechanisms

Numbers without error bars: The accuracy numbers in this chapter (87% baseline, 23% ablated) are illustrative, not from a single definitive study. Real numbers vary by model, prompt, and measurement method. When replicating, expect variance.

The induction head story is the best story we have about any transformer circuit. It’s also incomplete. Both facts are important.

19.12 Polya’s Perspective: Worked Example

This chapter applies Polya’s heuristic: study worked examples.

Before trying to reverse-engineer every capability, understand one capability completely. Induction heads are that worked example: - Well-defined behavior - Discoverable circuit - Verifiable mechanism - Applicable techniques

Once you’ve reverse-engineered one circuit completely, you have a template for reverse-engineering others. The process (attribution → attention analysis → ablation → patching → features → circuit diagram) transfers.

Polya’s Insight

“Study solutions to related problems.” You can’t learn proof techniques by reading theory alone—you need worked examples. Induction heads are the worked example for mechanistic interpretability. Master this case, then apply the approach to other circuits.

19.13 Looking Ahead

We’ve now seen the full interpretability workflow in action, applied to a real capability.

But interpretability research is incomplete. Many fundamental questions remain open:

How much of model behavior can we explain with circuits?
What capabilities resist circuit-based explanation?
How do we scale interpretability to 100B+ parameter models?
Can we use interpretability to improve safety and alignment?

These questions are the subject of the next chapter: Open Problems in Mechanistic Interpretability.

After that, we’ll close with A Practice Regime—concrete advice for how to actually do interpretability research, from choosing problems to debugging circuits to publishing results.

19.14 Key Takeaways

📋 Summary Card

┌────────────────────────────────────────────────────────────┐
│  INDUCTION HEADS: A Complete Case Study                    │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  WHAT THEY DO:   Enable in-context learning (few-shot)     │
│                  Pattern: [A][B]...[A] → predict [B]       │
│                                                            │
│  THE CIRCUIT (2 layers):                                   │
│    Layer 1: Previous Token Head                            │
│             → Records "B follows A" in residual stream     │
│    Layer 2: Induction Head                                 │
│             → Finds where A appeared, retrieves what       │
│               followed, predicts it will repeat            │
│                                                            │
│  KEY FINDINGS:                                             │
│    • Phase transition: emerges SUDDENLY during training    │
│    • Found in ALL transformer LLMs tested                  │
│    • K-composition: Layer 1 output → Layer 2 keys          │
│                                                            │
│  VERIFICATION CHECKLIST:                                   │
│    ✓ Attribution (high logit contribution)                 │
│    ✓ Attention pattern (diagonal stripe)                   │
│    ✓ Ablation (removing breaks the circuit)                │
│    ✓ Patching (restoring recovers behavior)                │
│    ✓ Features (interpretable SAE features)                 │
│                                                            │
│  WHY IT MATTERS:                                           │
│    Best-understood circuit in transformers                 │
│    Template for reverse-engineering other capabilities     │
│                                                            │
└────────────────────────────────────────────────────────────┘

19.15 Check Your Understanding

Question 1: Why can’t induction be implemented in a single layer?

Answer: At any position, attention is computed based on the current residual stream state. To implement induction, the model needs to: 1. Look backward for previous occurrences of the current token 2. Know what the previous token was at each earlier position

But the residual stream at position $i$ doesn’t natively contain “what was the previous token?” information. Layer 1 writes this information into the stream, enabling Layer 2 to use it. This is K-composition: Layer 1’s output modifies Layer 2’s keys, changing what Layer 2 attends to. Single-layer transformers would need exponentially larger models to solve induction tasks.

Question 2: What is the “diagonal stripe” pattern and why does it indicate an induction head?

Answer: When you visualize an induction head’s attention, you see positions attending to earlier positions with a constant offset. For example: - Position 5 attends to position 1 (offset -4) - Position 6 attends to position 2 (offset -4) - Position 7 attends to position 3 (offset -4)

This creates diagonal stripes across the attention matrix. The pattern emerges because the head is looking for “positions where the previous token matches my previous token”—and in repeated sequences, these matches occur at consistent offsets. Researchers use an induction score (measuring this diagonal pattern) to automatically detect induction heads.

Question 3: What does the “phase transition” during training tell us about induction heads?

Answer: Induction heads don’t exist at initialization—they emerge suddenly around a specific training step. Both the diagonal attention pattern and in-context learning performance spike simultaneously within just a few thousand steps, not gradually.

This tells us: 1. Induction heads are a discrete algorithmic solution, not a gradual improvement 2. Gradient descent “discovers” this algorithm once conditions are right (useful features in early layers, sufficient capacity) 3. Before the transition: model uses simple heuristics (unigram frequencies) 4. After: genuine in-context learning via pattern matching

The sharpness suggests induction heads are a qualitative leap in capability, not just quantitative improvement.

19.16 Further Reading

In-Context Learning and Induction Heads — Anthropic: The definitive paper on induction heads, including the phase transition discovery.
A Mathematical Framework for Transformer Circuits — Anthropic: The theoretical foundations for understanding composition in transformers.
Progress Measures for Grokking — arXiv:2301.05217: Analysis of the phase transition and what causes sudden capability emergence.
Induction Head Replication — Neel Nanda: Step-by-step guide to finding induction heads in any transformer.
The Quantization Model of Neural Scaling — arXiv:2303.13506: Theoretical framework explaining why capabilities emerge suddenly (phase transitions).
Transformer Circuits Thread — Anthropic: Collection of papers reverse-engineering transformer circuits, with induction heads as a central example.

--- title: "Induction Heads" subtitle: "A complete case study" author: "Taras Tsugrii" date: 2025-01-05 categories: [synthesis, circuits, case-study] description: "Induction heads enable in-context learning—the ability to learn patterns from examples without weight updates. They're the best-understood circuit in transformers." --- ::: {.callout-note} ## Hands-On Notebook <a href="https://colab.research.google.com/github/ttsugriy/mechinterp-first-principles/blob/main/notebooks/13-induction-heads.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> Find induction heads, visualize their attention patterns, ablate them and measure the effect. ::: ::: {.callout-tip} ## What You'll Learn - How induction heads enable in-context learning (few-shot prompting) - The two-layer circuit: previous-token head + induction head - The phase transition: how induction heads emerge suddenly during training - How to apply attribution, patching, and ablation to verify a circuit ::: ::: {.callout-warning} ## Prerequisites **Required**: All of Arc II (Features, Superposition, Circuits) and Arc III (SAEs, Attribution, Patching, Ablation) This chapter synthesizes everything you've learned. ::: ::: {.callout-note} ## Before You Read: Recall From Arc II & III, recall: - Features are directions; circuits are how they compose (Chapters 5, 8) - SAEs extract interpretable features from superposition (Chapter 9) - Attribution finds candidates; patching tests causation; ablation tests necessity (Chapters 10-12) We have theory and tools. **Now we apply everything**: reverse-engineer a complete circuit. ::: ## The Complete Picture We've built a complete interpretability toolkit: **Theory** (Arc II): - Features as directions - Superposition and sparsity - Circuits as composable algorithms **Techniques** (Arc III): - SAEs for extracting features - Attribution for finding correlations - Patching for testing causation - Ablation for identifying necessity Now we apply everything to understand a single capability: **in-context learning**—the ability of language models to learn from examples within a prompt, without any gradient updates. This chapter is a complete case study. We'll reverse-engineer the circuit responsible for this capability, using every technique we've learned. ::: {.callout-note} ## The Target Capability **In-context learning**: Given examples in a prompt, the model predicts continuations that follow the demonstrated pattern—without any training. Example: "The cat chased the mouse. The dog chased the ___" → "bone" (following the "[animal] chased [prey/toy]" pattern) How does this work? The answer is **induction heads**. ::: ## The Pattern Induction heads detect and continue **copying patterns**. The simplest case: ``` Input: [A] [B] ... [A] → ? Output: [B] ``` When the model sees token $A$ followed later by token $B$, then encounters $A$ again, it predicts $B$. ### Concrete Examples **Exact copying**: ``` the quick brown fox jumped over the quick → brown ``` **Pattern generalization**: ``` When John went to the store, John → when ``` (The model predicts "when" would follow "John" the second time, mirroring the structure after the first "John") **In-context learning**: ``` A: Paris. Q: France. A: Berlin. Q: Germany. A: Madrid. Q: → Spain ``` (The model learns the Q-A pattern from examples in the prompt) All three involve the same core mechanism: detecting a repeated token and retrieving what followed it previously. ::: {.callout-important} ## Why This Matters Induction heads are the primary mechanism for few-shot learning in transformers. When you show GPT-4 three examples and it generalizes on the fourth, induction heads are doing much of the work. Understanding induction heads is understanding how transformers learn from context. ::: ## The Two-Layer Circuit Induction heads aren't a single component—they're a **circuit** involving multiple attention heads across two layers. ::: {.callout-note} ## The Discovery Story In 2022, Catherine Olsson, Nelson Elhage, and collaborators at Anthropic were tracking how language models develop capabilities during training. They noticed something strange: at a specific point in training, both in-context learning ability and a distinctive attention pattern appeared *simultaneously*—within just a few thousand steps. When they investigated, they found a beautiful two-layer circuit. The first layer records "what follows each token." The second layer uses that information to copy patterns. They called these "induction heads" because they perform induction—generalizing from examples. The discovery was published as "In-context Learning and Induction Heads" and remains one of the clearest examples of reverse-engineering a neural network capability. ::: ### The Algorithm **Layer 1: Previous Token Head** - Attention pattern: Each position attends to the previous token - Effect: Writes information to the residual stream indicating "token $B$ came after token $A$" **Layer 2: Induction Head** - Attention pattern: Looks for positions where the previous token matches the current token - Effect: When it finds a match, it copies the token that followed The composition: 1. At position $i$, token is $A$ 2. Previous token head at position $i+1$ writes "$B$ follows $A$" into the residual stream 3. Later, at position $j$, token is again $A$ 4. Induction head at position $j$ searches for positions where the previous token was also $A$ 5. Finds position $i$ (where $A$ occurred before) 6. Copies the token at position $i+1$, which is $B$ ::: {.callout-tip} ## The Two-Head Composition Previous token head: "Record what came before" Induction head: "Find where this token appeared before and copy what followed" Together: pattern matching and retrieval. ::: ```{mermaid} %%| fig-cap: "The induction circuit: Layer 1 records 'B follows A', Layer 2 uses this to predict B when A repeats." %%| fig-width: 8 flowchart LR A["Token A"] --> PTH["Previous Token Head<br/>(Layer 1)"] B["Token B"] --> PTH PTH --> RS["Residual Stream:<br/>'B follows A'"] RS --> IH["Induction Head<br/>(Layer 2)"] A2["Token A<br/>(repeated)"] --> IH IH --> OUT["Predict: B"] ``` ### Interactive: Watch the Induction Algorithm Step through the induction head algorithm to see how two layers work together. The **Previous Token Head** records what follows each token, and the **Induction Head** uses this to predict continuations when tokens repeat. ```{ojs} //| label: fig-induction-interactive //| fig-cap: "Step through the induction head algorithm. Watch how Layer 1 records patterns and Layer 2 retrieves them." viewof currentStep = Inputs.range([0, 5], {step: 1, value: 0, label: "Algorithm step"}) tokens = ["the", "quick", "brown", "fox", "the", "?"] positions = [0, 1, 2, 3, 4, 5] stepDescriptions = [ "Initial sequence: We see 'the quick brown fox the ?' — What should follow the second 'the'?", "Layer 1: Previous Token Head processes the sequence, recording what follows each token.", "At position 1, it records: 'quick follows the' into the residual stream.", "Layer 2: When we reach position 4 (second 'the'), the Induction Head searches...", "It finds position 0 where 'the' appeared before, and checks what followed (position 1).", "Result: 'quick' followed 'the' before, so predict 'quick' again! ✓" ] // Visualization { const width = 700; const height = 380; const svg = d3.create("svg") .attr("viewBox", [0, 0, width, height]) .attr("width", width) .attr("height", height); // Token display area const tokenY = 60; const tokenSpacing = 90; const startX = 80; // Draw tokens tokens.forEach((token, i) => { const x = startX + i * tokenSpacing; // Highlight based on step let fillColor = "#f5f5f5"; let strokeColor = "#999"; let strokeWidth = 1; if (currentStep >= 1 && i <= 1) { // Recording phase - highlight first "the" and "quick" if (currentStep === 2 && (i === 0 || i === 1)) { fillColor = "#e3f2fd"; strokeColor = "#1976d2"; strokeWidth = 2; } } if (currentStep >= 3 && i === 4) { // Searching phase - highlight second "the" fillColor = "#fff3e0"; strokeColor = "#f57c00"; strokeWidth = 2; } if (currentStep >= 4 && i === 0) { // Found match fillColor = "#e8f5e9"; strokeColor = "#388e3c"; strokeWidth = 2; } if (currentStep >= 5 && i === 1) { // Retrieving "quick" fillColor = "#c8e6c9"; strokeColor = "#2e7d32"; strokeWidth = 3; } if (currentStep >= 5 && i === 5) { // Prediction slot fillColor = "#c8e6c9"; strokeColor = "#2e7d32"; strokeWidth = 3; } // Token box svg.append("rect") .attr("x", x - 35) .attr("y", tokenY - 18) .attr("width", 70) .attr("height", 36) .attr("rx", 6) .attr("fill", fillColor) .attr("stroke", strokeColor) .attr("stroke-width", strokeWidth); // Token text svg.append("text") .attr("x", x) .attr("y", tokenY + 5) .attr("text-anchor", "middle") .attr("font-size", "14px") .attr("font-weight", "bold") .attr("fill", "#333") .text(i === 5 && currentStep >= 5 ? "quick" : token); // Position number svg.append("text") .attr("x", x) .attr("y", tokenY + 28) .attr("text-anchor", "middle") .attr("font-size", "10px") .attr("fill", "#666") .text(`pos ${i}`); }); // Draw arrows for pattern matching (step 4+) if (currentStep >= 4) { // Arrow from position 4 back to position 0 svg.append("path") .attr("d", `M ${startX + 4*tokenSpacing} ${tokenY + 40} Q ${startX + 2*tokenSpacing} ${tokenY + 90} ${startX + 0*tokenSpacing} ${tokenY + 40}`) .attr("fill", "none") .attr("stroke", "#f57c00") .attr("stroke-width", 2) .attr("marker-end", "url(#arrowhead)"); svg.append("text") .attr("x", startX + 2*tokenSpacing) .attr("y", tokenY + 100) .attr("text-anchor", "middle") .attr("font-size", "11px") .attr("fill", "#f57c00") .text("'the' matches! Check what followed..."); } // Draw retrieval arrow (step 5) if (currentStep >= 5) { // Arrow from position 1 to prediction svg.append("path") .attr("d", `M ${startX + 1*tokenSpacing} ${tokenY - 30} Q ${startX + 3*tokenSpacing} ${tokenY - 70} ${startX + 5*tokenSpacing} ${tokenY - 30}`) .attr("fill", "none") .attr("stroke", "#388e3c") .attr("stroke-width", 2) .attr("marker-end", "url(#arrowhead-green)"); svg.append("text") .attr("x", startX + 3*tokenSpacing) .attr("y", tokenY - 75) .attr("text-anchor", "middle") .attr("font-size", "11px") .attr("fill", "#388e3c") .text("Copy 'quick' to prediction!"); } // Arrow markers svg.append("defs").append("marker") .attr("id", "arrowhead") .attr("viewBox", "0 -5 10 10") .attr("refX", 8) .attr("markerWidth", 6) .attr("markerHeight", 6) .attr("orient", "auto") .append("path") .attr("d", "M0,-5L10,0L0,5") .attr("fill", "#f57c00"); svg.append("defs").append("marker") .attr("id", "arrowhead-green") .attr("viewBox", "0 -5 10 10") .attr("refX", 8) .attr("markerWidth", 6) .attr("markerHeight", 6) .attr("orient", "auto") .append("path") .attr("d", "M0,-5L10,0L0,5") .attr("fill", "#388e3c"); // Layer indicators const layerY = 180; // Layer 1 box const layer1Active = currentStep >= 1 && currentStep <= 2; svg.append("rect") .attr("x", 50) .attr("y", layerY) .attr("width", 280) .attr("height", 70) .attr("rx", 8) .attr("fill", layer1Active ? "#e3f2fd" : "#fafafa") .attr("stroke", layer1Active ? "#1976d2" : "#ddd") .attr("stroke-width", layer1Active ? 2 : 1); svg.append("text") .attr("x", 190) .attr("y", layerY + 20) .attr("text-anchor", "middle") .attr("font-size", "12px") .attr("font-weight", "bold") .attr("fill", layer1Active ? "#1976d2" : "#666") .text("Layer 1: Previous Token Head"); svg.append("text") .attr("x", 190) .attr("y", layerY + 40) .attr("text-anchor", "middle") .attr("font-size", "11px") .attr("fill", "#666") .text("Records: \"B follows A\""); if (currentStep >= 2) { svg.append("text") .attr("x", 190) .attr("y", layerY + 55) .attr("text-anchor", "middle") .attr("font-size", "11px") .attr("fill", "#1976d2") .text("→ \"quick follows the\""); } // Layer 2 box const layer2Active = currentStep >= 3; svg.append("rect") .attr("x", 370) .attr("y", layerY) .attr("width", 280) .attr("height", 70) .attr("rx", 8) .attr("fill", layer2Active ? "#fff3e0" : "#fafafa") .attr("stroke", layer2Active ? "#f57c00" : "#ddd") .attr("stroke-width", layer2Active ? 2 : 1); svg.append("text") .attr("x", 510) .attr("y", layerY + 20) .attr("text-anchor", "middle") .attr("font-size", "12px") .attr("font-weight", "bold") .attr("fill", layer2Active ? "#f57c00" : "#666") .text("Layer 2: Induction Head"); svg.append("text") .attr("x", 510) .attr("y", layerY + 40) .attr("text-anchor", "middle") .attr("font-size", "11px") .attr("fill", "#666") .text("Searches for matching previous token"); if (currentStep >= 5) { svg.append("text") .attr("x", 510) .attr("y", layerY + 55) .attr("text-anchor", "middle") .attr("font-size", "11px") .attr("fill", "#388e3c") .text("→ Predicts: \"quick\""); } // Step description svg.append("rect") .attr("x", 30) .attr("y", 290) .attr("width", 640) .attr("height", 60) .attr("rx", 8) .attr("fill", "#f5f5f5") .attr("stroke", "#ddd"); svg.append("text") .attr("x", 350) .attr("y", 325) .attr("text-anchor", "middle") .attr("font-size", "13px") .attr("fill", "#333") .text(stepDescriptions[currentStep]); return svg.node(); } ``` ### Why Two Layers? Transformers can't implement induction in a single layer because attention is computed from the current residual stream state. At position $j$, the model needs to: 1. Look backward for previous occurrences of the current token 2. To do this, it needs to know what the *previous* token at each earlier position was But at position $i$, the residual stream doesn't natively contain "what's the previous token?" information. Layer 1 writes this information into the stream, enabling layer 2 to use it. This is **K-composition** (Chapter 8): layer 1's output modifies layer 2's keys, changing what layer 2 attends to. ## Discovery Through Attention Patterns The first clue that induction heads exist came from visualizing attention patterns. ### The Signature Pattern When you visualize what an induction head attends to, you see a distinctive **diagonal stripe** pattern: ``` Position: 0 1 2 3 4 5 6 7 8 Token: A B C A B ? Layer 2 attention at position 5: Position: 0 1 2 3 4 5 6 7 8 Attention: 0 █ 0 0 █ 0 0 0 0 ``` The head at position 5 (token $B$) strongly attends to position 1 (the previous occurrence of $B$) where the prior token was $A$. This creates diagonal stripes across the attention matrix because: - When processing position 5, attend to position 1 (offset -4) - When processing position 6, attend to position 2 (offset -4) - When processing position 7, attend to position 3 (offset -4) The constant offset creates a diagonal. ### Searching for Induction Heads Researchers developed an **induction score** to automatically detect these heads: 1. Create inputs with repeated sequences: "[random] [random] [random] [random]..." 2. Measure whether head $H$ at position $i$ attends to position $j$ where $\text{token}[j-1] = \text{token}[i-1]$ 3. High score → likely induction head **Finding**: In every transformer language model tested (GPT-2, GPT-3, BLOOM, LLaMA), induction heads emerge in early layers (typically layers 1-2 for small models, proportionally early layers for larger models), comprising 5-15% of all attention heads. Crucially, the circuit requires at least two layers—a mathematical necessity proven by communication complexity arguments. Single-layer transformers would need exponentially larger models to solve induction tasks. ## The Phase Transition One of the most striking findings: induction heads don't exist at initialization. They **emerge suddenly** during training. ### The Training Dynamics Anthropic researchers tracked training on a small model, measuring: - **Induction score**: How strong the diagonal attention pattern is - **In-context learning performance**: How well the model continues patterns Both metrics show a **sharp phase transition**: ``` Training steps: 0 5K 10K 15K 20K Induction score: 0.0 0.0 0.0 0.8 0.9 ICL accuracy: 20% 21% 22% 72% 85% ``` Around step 15K, both metrics spike simultaneously: - Induction heads suddenly develop the diagonal attention pattern - In-context learning capability suddenly improves The transition isn't gradual—it's a discrete shift from "no induction heads" to "strong induction heads" over just a few thousand steps. ::: {.callout-important} ## The Phase Transition Insight Induction heads aren't present from the start—they're discovered by gradient descent as a sharp improvement to the loss. Their sudden emergence suggests they're a discrete algorithmic solution that the optimizer finds and implements quickly once conditions are right. ::: ::: {.callout-note} ## Beyond Induction: Multi-Phase Emergence (2024) Recent research reveals that induction heads are just one phase in a sequence of algorithmic discoveries during training. Studies tracking circuit formation found models go through multiple distinct phases—developing token copying first, then pattern matching, then more sophisticated contextual algorithms. Each phase shows its own sharp transition. This suggests induction heads are a stepping stone, not a final destination: the model builds increasingly sophisticated circuits by composing simpler ones learned earlier. ::: ### What Causes the Transition? The phase transition happens when: 1. Earlier layers have learned useful features (token identities, positional information) 2. The model has enough capacity to implement the two-layer circuit 3. Training data provides sufficient signal for the induction pattern Before the transition: the model relies on simple heuristics (unigram frequencies, positional biases). After the transition: the model uses genuine in-context learning via pattern matching. ## Reverse-Engineering the Circuit Let's apply our interpretability toolkit to verify the induction head mechanism. ### Step 1: Attribution Which components contribute to induction predictions? Run the model on: "the quick brown fox jumped over the quick → ___" Measure logit attribution for "brown": | Component | Attribution to "brown" | |-----------|----------------------| | Head 1.5 (previous token) | +0.8 | | Head 2.3 (induction) | +2.4 | | MLP layer 2 | +0.6 | | Other heads | < 0.2 each | **Finding**: Head 2.3 has high attribution. This is a candidate induction head. ### Step 2: Attention Pattern Analysis Visualize where head 2.3 attends: When processing "the quick" (second occurrence), head 2.3 strongly attends to "the quick" (first occurrence) with offset matching the previous token pattern. **Confirmation**: The attention pattern matches the induction head signature. ### Step 3: Ablation What happens if we remove head 2.3? Ablate head 2.3 (mean ablation) and measure: - Baseline: 87% accuracy on induction tasks - Ablated: 23% accuracy **Finding**: Performance collapses. Head 2.3 is necessary. What about head 1.5 (previous token head)? - Ablated head 1.5: 19% accuracy **Finding**: Head 1.5 is also necessary. Both components of the circuit are critical. ### Step 4: Patching Clean: "A B C D E F A B → C" Corrupted: "A B C D E F X Y → ?" Patch head 2.3's output from clean to corrupted: - Corrupted logit for "C": -1.2 - Patched logit for "C": +0.8 **Recovery**: 80%+. Patching head 2.3 largely restores the correct prediction. Path patching: Does head 1.5 → head 2.3 path matter? Patch head 1.5's contribution to head 2.3's keys: - Recovery: 65% **Finding**: The connection from head 1.5 to head 2.3 is causally important. This confirms the two-layer circuit. ### Step 5: Feature Analysis (with SAEs) Train a sparse autoencoder on layer 2 activations. Which features activate strongly during induction? - Feature 1,248: "Repeated token detection" - Feature 3,891: "Previous position offset" - Feature 7,102: "Copy operation" Ablating feature 1,248: Accuracy drops from 87% to 31%. **Finding**: Specific features encode the components of the induction algorithm. ## The Complete Circuit Diagram Putting it all together: ``` Position i: [token A] ↓ Layer 1: Previous Token Head (1.5) Writes "token A followed by ___" to residual stream ↓ Position i+1: [token B] Reads residual stream Associates "A → B" ↓ ...later... ↓ Position j: [token A] (repeated) ↓ Layer 2: Induction Head (2.3) Query: "Where did A appear before?" Key matching: Finds position i (using previous token info from head 1.5) Value: Retrieves token at position i+1 ↓ Output: Predict "B" ``` **Verification**: - Attribution: ✓ (head 2.3 has highest attribution) - Attention: ✓ (diagonal stripe pattern) - Ablation: ✓ (removing either head breaks the circuit) - Patching: ✓ (patching restores behavior) - Features: ✓ (interpretable features encode the algorithm) This is a **fully reverse-engineered circuit**. ## Generalizations and Variations Induction heads aren't a single universal algorithm—they're a family of related circuits. ### Fuzzy Matching Some induction heads don't require exact token matches. They trigger on: - Semantic similarity ("Paris" → "France" matches "Berlin" → "Germany") - Structural similarity (matching syntax, not content) These "fuzzy induction heads" enable more sophisticated in-context learning. ### Multi-Token Patterns Some induction heads track longer sequences: [A][B][C] ... [A][B] → [C] These enable learning from richer context. ### Position-Dependent Induction Some heads combine induction with positional information: - "This token appeared $k$ positions ago" - "Copy, but only if within the last $n$ tokens" These add constraints to the copying mechanism. ### Translation Induction In multilingual models: - "French word X translates to English word Y" - Later: "French word Z translates to..." → retrieve the translation pattern This is induction across languages. ## Why Induction Heads Matter Induction heads are foundational: ### 1. They Enable In-Context Learning The core capability that makes few-shot prompting work. Without induction heads, language models couldn't generalize from examples in context. ### 2. They Emerge Reliably Every large language model develops induction heads. This suggests they're a convergent solution—gradient descent discovers them independently across architectures, scales, and training regimes. ### 3. They're Understandable Unlike most neural network behaviors, the induction circuit is: - Localizable (specific heads in specific layers) - Interpretable (the algorithm is clear) - Verifiable (all techniques confirm the mechanism) This makes induction heads the **best-understood capability** in transformers. ::: {.callout-note} ## Try It Yourself: Find Induction Heads in TransformerLens If you want to see induction heads directly, [TransformerLens](https://github.com/neelnanda-io/TransformerLens) makes it straightforward: ```python import transformer_lens as tl model = tl.HookedTransformer.from_pretrained("gpt2-small") # Create a sequence with repetition prompt = "A B C D E A B C D" # Should predict "E" # Get attention patterns _, cache = model.run_with_cache(prompt) # Look at layer 2 (where induction heads live in GPT-2) # Induction heads show diagonal stripes in their attention patterns attn = cache["pattern", 2] # Shape: (batch, head, query_pos, key_pos) ``` Look for heads with strong attention to positions where the previous token matches the current token's previous token. That's the induction signature. ::: ### 4. They Demonstrate Composition The circuit requires two layers working together—K-composition between previous token heads and induction heads. This is proof that transformers build complex algorithms by composing simple components. ## Connections to Broader Capabilities Induction heads aren't isolated—they connect to many model capabilities. ### Translation Parallel corpus learning: "French: bonjour. English: hello. French: merci. English: → thank you" Induction pattern: [source language token] → [target language token] ### Code Completion Pattern: Function signature → function body ```python def add(a, b): return a + b def multiply(a, b): return → a * b ``` ### Analogical Reasoning "King is to Queen as Man is to → Woman" This is induction across semantic spaces. ### Instruction Following "Q: What is 2+2? A: 4. Q: What is 3+3? A: → 6" The Q-A structure is learned via induction. ::: {.callout-note} ## The Unifying Principle Many "emergent capabilities" may be sophisticated applications of induction heads. The basic copying circuit, combined with semantic features, enables learning from examples across domains. ::: ## Limitations and Open Questions Despite being well-understood, induction heads leave questions unanswered: ### What's the Capacity Limit? How many patterns can induction heads track simultaneously? Early experiments suggest ~10-20, but this varies by model and context length. ### How Do They Interact with Other Circuits? Induction heads are part of a larger system. How do they interact with: - Factual recall circuits - Reasoning circuits - Output formatting circuits The interfaces aren't fully mapped. ### Why This Algorithm? Gradient descent discovered induction heads, but are they optimal? Could there be better algorithms for in-context learning that transformers haven't found? ### Do They Scale? Induction heads are clear in small models (GPT-2, 124M parameters). In large models (70B+ parameters), are the circuits still as clean? Early evidence suggests more redundancy and fuzzier boundaries. ::: {.callout-warning} ## Honest Assessment: Scope of These Findings The induction head circuit is remarkably well-understood—*for GPT-2 Small*. Here's an honest calibration: **What we know well:** - The two-layer circuit (previous token + induction head) in 124M-1B parameter models - The phase transition during training - That every tested model develops some form of induction heads **What we know less well:** - Exact circuit details in 70B+ parameter production models - How induction heads interact with other circuits in complex prompts - Whether the clean two-head story holds at scale or becomes messier **What we don't know:** - The capacity limits (how many patterns simultaneously?) - Whether there are better algorithms the models haven't found - How much in-context learning is induction heads vs. other mechanisms **Numbers without error bars**: The accuracy numbers in this chapter (87% baseline, 23% ablated) are illustrative, not from a single definitive study. Real numbers vary by model, prompt, and measurement method. When replicating, expect variance. The induction head story is the *best* story we have about any transformer circuit. It's also incomplete. Both facts are important. ::: ## Polya's Perspective: Worked Example This chapter applies Polya's heuristic: **study worked examples**. Before trying to reverse-engineer every capability, understand one capability completely. Induction heads are that worked example: - Well-defined behavior - Discoverable circuit - Verifiable mechanism - Applicable techniques Once you've reverse-engineered one circuit completely, you have a template for reverse-engineering others. The process (attribution → attention analysis → ablation → patching → features → circuit diagram) transfers. ::: {.callout-tip} ## Polya's Insight "Study solutions to related problems." You can't learn proof techniques by reading theory alone—you need worked examples. Induction heads are the worked example for mechanistic interpretability. Master this case, then apply the approach to other circuits. ::: ## Looking Ahead We've now seen the full interpretability workflow in action, applied to a real capability. But interpretability research is incomplete. Many fundamental questions remain open: - How much of model behavior can we explain with circuits? - What capabilities resist circuit-based explanation? - How do we scale interpretability to 100B+ parameter models? - Can we use interpretability to improve safety and alignment? These questions are the subject of the next chapter: **Open Problems in Mechanistic Interpretability**. After that, we'll close with **A Practice Regime**—concrete advice for how to actually do interpretability research, from choosing problems to debugging circuits to publishing results. --- ## Key Takeaways ::: {.callout-tip} ## 📋 Summary Card ``` ┌────────────────────────────────────────────────────────────┐ │ INDUCTION HEADS: A Complete Case Study │ ├────────────────────────────────────────────────────────────┤ │ │ │ WHAT THEY DO: Enable in-context learning (few-shot) │ │ Pattern: [A][B]...[A] → predict [B] │ │ │ │ THE CIRCUIT (2 layers): │ │ Layer 1: Previous Token Head │ │ → Records "B follows A" in residual stream │ │ Layer 2: Induction Head │ │ → Finds where A appeared, retrieves what │ │ followed, predicts it will repeat │ │ │ │ KEY FINDINGS: │ │ • Phase transition: emerges SUDDENLY during training │ │ • Found in ALL transformer LLMs tested │ │ • K-composition: Layer 1 output → Layer 2 keys │ │ │ │ VERIFICATION CHECKLIST: │ │ ✓ Attribution (high logit contribution) │ │ ✓ Attention pattern (diagonal stripe) │ │ ✓ Ablation (removing breaks the circuit) │ │ ✓ Patching (restoring recovers behavior) │ │ ✓ Features (interpretable SAE features) │ │ │ │ WHY IT MATTERS: │ │ Best-understood circuit in transformers │ │ Template for reverse-engineering other capabilities │ │ │ └────────────────────────────────────────────────────────────┘ ``` ::: ## Check Your Understanding ::: {.callout-note collapse="true"} ## Question 1: Why can't induction be implemented in a single layer? **Answer**: At any position, attention is computed based on the current residual stream state. To implement induction, the model needs to: 1. Look backward for previous occurrences of the current token 2. Know what the *previous* token was at each earlier position But the residual stream at position $i$ doesn't natively contain "what was the previous token?" information. **Layer 1 writes this information** into the stream, enabling Layer 2 to use it. This is K-composition: Layer 1's output modifies Layer 2's keys, changing what Layer 2 attends to. Single-layer transformers would need exponentially larger models to solve induction tasks. ::: ::: {.callout-note collapse="true"} ## Question 2: What is the "diagonal stripe" pattern and why does it indicate an induction head? **Answer**: When you visualize an induction head's attention, you see positions attending to earlier positions with a **constant offset**. For example: - Position 5 attends to position 1 (offset -4) - Position 6 attends to position 2 (offset -4) - Position 7 attends to position 3 (offset -4) This creates diagonal stripes across the attention matrix. The pattern emerges because the head is looking for "positions where the previous token matches my previous token"—and in repeated sequences, these matches occur at consistent offsets. Researchers use an **induction score** (measuring this diagonal pattern) to automatically detect induction heads. ::: ::: {.callout-note collapse="true"} ## Question 3: What does the "phase transition" during training tell us about induction heads? **Answer**: Induction heads don't exist at initialization—they **emerge suddenly** around a specific training step. Both the diagonal attention pattern and in-context learning performance spike simultaneously within just a few thousand steps, not gradually. This tells us: 1. Induction heads are a **discrete algorithmic solution**, not a gradual improvement 2. Gradient descent "discovers" this algorithm once conditions are right (useful features in early layers, sufficient capacity) 3. Before the transition: model uses simple heuristics (unigram frequencies) 4. After: genuine in-context learning via pattern matching The sharpness suggests induction heads are a qualitative leap in capability, not just quantitative improvement. ::: --- ## Further Reading 1. **In-Context Learning and Induction Heads** — [Anthropic](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html): The definitive paper on induction heads, including the phase transition discovery. 2. **A Mathematical Framework for Transformer Circuits** — [Anthropic](https://transformer-circuits.pub/2021/framework/index.html): The theoretical foundations for understanding composition in transformers. 3. **Progress Measures for Grokking** — [arXiv:2301.05217](https://arxiv.org/abs/2301.05217): Analysis of the phase transition and what causes sudden capability emergence. 4. **Induction Head Replication** — [Neel Nanda](https://www.neelnanda.io/mechanistic-interpretability/induction-heads): Step-by-step guide to finding induction heads in any transformer. 5. **The Quantization Model of Neural Scaling** — [arXiv:2303.13506](https://arxiv.org/abs/2303.13506): Theoretical framework explaining why capabilities emerge suddenly (phase transitions). 6. **Transformer Circuits Thread** — [Anthropic](https://transformer-circuits.pub/): Collection of papers reverse-engineering transformer circuits, with induction heads as a central example.