# Chapter 13: Induction Heads - Hands-On Notebook

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ttsugriy/mechinterp-first-principles/blob/main/notebooks/13-induction-heads.ipynb)

This notebook accompanies [Chapter 13: Induction Heads](https://ttsugriy.github.io/mechinterp-first-principles/chapters/13-induction-heads.html).

**What you'll do:**
1. Find induction heads in GPT-2 Small
2. Verify they perform the induction pattern
3. Ablate them and measure the effect
4. Understand the two-layer circuit

**Time:** ~45 minutes

## Setup

In [None]:
# Step 1: Install libraries (run this cell, then restart runtime!)
!pip install -q transformer-lens circuitsvis einops

print("Installation complete!")
print("Now restart the runtime: Runtime -> Restart runtime")
print("Then skip this cell and run the next one.")

In [None]:
# Step 2: Import libraries (run this after restarting runtime)
import torch
import numpy as np
import matplotlib.pyplot as plt
from transformer_lens import HookedTransformer, utils
import circuitsvis as cv
from einops import rearrange

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

torch.manual_seed(42)

In [None]:
model = HookedTransformer.from_pretrained("gpt2-small", device=device)
print(f"Loaded GPT-2 Small: {model.cfg.n_layers} layers, {model.cfg.n_heads} heads per layer")

## 1. The Induction Task

Induction heads detect repeated patterns: if "A B" appeared before, and we see "A" again, predict "B".

In [None]:
# Create a sequence with repetition
# Format: [random tokens] A B [random tokens] A -> should predict B

def create_induction_sequence(token_a="Paris", token_b="France"):
    """Create a sequence that tests induction."""
    text = f"The city of {token_a} is in {token_b}. I visited {token_a}"
    return text

text = create_induction_sequence()
print(f"Test sequence: '{text}'")
print(f"Expected: model should predict 'France' or similar after second 'Paris'")

In [None]:
# Run the model
tokens = model.to_tokens(text)
logits, cache = model.run_with_cache(tokens)

# What does it predict?
final_logits = logits[0, -1]
top_preds = torch.topk(final_logits, k=5)

print("Top predictions after 'Paris':")
for idx, logit in zip(top_preds.indices, top_preds.values):
    print(f"  '{model.to_string(idx)}' (logit={logit:.2f})")

## 2. Find Induction Heads

Induction heads have a distinctive attention pattern: they attend to the token that *followed* the previous occurrence of the current token.

In [None]:
# Create a simple repeated sequence for clearer visualization
repeated_text = "A B C D E A B C D"
repeated_tokens = model.to_tokens(repeated_text)
_, repeated_cache = model.run_with_cache(repeated_tokens)

token_strs = [model.to_string(t) for t in repeated_tokens[0]]
print(f"Tokens: {token_strs}")
print(f"\nIf induction works: position 6 ('A') should attend to position 1 ('B')")
print(f"Because 'B' followed 'A' at positions 0-1")

In [None]:
def measure_induction_score(cache, seq_len):
    """
    Measure how much each head attends to the "induction position".
    For repeated sequence A B C D E A B C D:
    - Position 6 (second A) should attend to position 1 (first B)
    - Position 7 (second B) should attend to position 2 (first C)
    etc.
    """
    scores = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
    
    for layer in range(model.cfg.n_layers):
        attn = cache["pattern", layer][0]  # (heads, seq, seq)
        
        for head in range(model.cfg.n_heads):
            # For the repeated part, check if it attends to induction position
            # Position 6 should attend to position 1 (offset by 5, looking at +1)
            induction_attn = 0
            count = 0
            
            # Positions 6, 7, 8, 9 are the repeated A, B, C, D
            for query_pos in range(6, min(10, seq_len)):
                # The induction target is query_pos - 5 + 1 = query_pos - 4
                key_pos = query_pos - 4
                if key_pos > 0 and key_pos < query_pos:
                    induction_attn += attn[head, query_pos, key_pos].item()
                    count += 1
            
            if count > 0:
                scores[layer, head] = induction_attn / count
    
    return scores

induction_scores = measure_induction_score(repeated_cache, len(token_strs))

# Find top induction heads
flat_scores = induction_scores.flatten()
top_indices = torch.topk(flat_scores, k=10).indices

print("Top 10 heads by induction score:")
for idx in top_indices:
    layer = idx.item() // model.cfg.n_heads
    head = idx.item() % model.cfg.n_heads
    score = induction_scores[layer, head].item()
    print(f"  Layer {layer}, Head {head}: {score:.3f}")

In [None]:
# Visualize the induction scores as a heatmap
plt.figure(figsize=(12, 6))
plt.imshow(induction_scores.numpy(), cmap='Reds', aspect='auto')
plt.colorbar(label='Induction Score')
plt.xlabel('Head')
plt.ylabel('Layer')
plt.title('Induction Scores by Head\n(Higher = more induction-like behavior)')
plt.xticks(range(model.cfg.n_heads))
plt.yticks(range(model.cfg.n_layers))
plt.tight_layout()
plt.show()

## 3. Visualize an Induction Head

Let's look at the attention pattern of a high-scoring induction head.

In [None]:
# Get the best induction head
best_idx = torch.argmax(induction_scores)
best_layer = best_idx.item() // model.cfg.n_heads
best_head = best_idx.item() % model.cfg.n_heads

print(f"Best induction head: Layer {best_layer}, Head {best_head}")
print(f"Score: {induction_scores[best_layer, best_head]:.3f}")

In [None]:
# Visualize its attention pattern
attn_pattern = repeated_cache["pattern", best_layer][0, best_head].cpu().numpy()

plt.figure(figsize=(10, 8))
plt.imshow(attn_pattern, cmap='Blues')
plt.colorbar(label='Attention Weight')
plt.xticks(range(len(token_strs)), token_strs, rotation=45, ha='right')
plt.yticks(range(len(token_strs)), token_strs)
plt.xlabel('Key (attending to)')
plt.ylabel('Query (attending from)')
plt.title(f'Induction Head Attention Pattern\nLayer {best_layer}, Head {best_head}')

# Draw boxes around induction positions
for i, (q, k) in enumerate([(6, 2), (7, 3), (8, 4), (9, 5)]):
    if q < len(token_strs) and k < len(token_strs):
        plt.plot(k, q, 'ro', markersize=15, fillstyle='none', markeredgewidth=2)

plt.tight_layout()
plt.show()

print("Red circles mark induction positions:")
print("Position 6 (A) → Position 2 (C): attend to what followed previous A")

## 4. Ablate the Induction Head

If we remove the induction head, the model should perform worse on induction tasks.

In [None]:
def get_induction_loss(model, text, target_token):
    """Measure how well the model predicts the target after induction."""
    tokens = model.to_tokens(text)
    logits = model(tokens)
    
    # Get logit for target token at last position
    target_id = model.to_tokens(target_token)[0, 1]  # Skip BOS
    target_logit = logits[0, -1, target_id].item()
    
    # Get rank of target
    rank = (logits[0, -1] > target_logit).sum().item()
    
    return target_logit, rank

# Test on our induction sequence
test_text = "The word hello is followed by world. The word hello is followed by"
target = " world"

clean_logit, clean_rank = get_induction_loss(model, test_text, target)
print(f"Clean model:")
print(f"  Target '{target}' logit: {clean_logit:.2f}")
print(f"  Target rank: {clean_rank}")

In [None]:
def ablate_head(layer, head):
    """Create a hook that zeros out a specific head."""
    def hook(activation, hook):
        activation[:, :, head, :] = 0
        return activation
    return hook

# Ablate the best induction head
hook_name = f"blocks.{best_layer}.attn.hook_z"

with model.hooks(fwd_hooks=[(hook_name, ablate_head(best_layer, best_head))]):
    ablated_logit, ablated_rank = get_induction_loss(model, test_text, target)

print(f"\nWith Layer {best_layer} Head {best_head} ablated:")
print(f"  Target '{target}' logit: {ablated_logit:.2f} (was {clean_logit:.2f})")
print(f"  Target rank: {ablated_rank} (was {clean_rank})")
print(f"\n  Logit drop: {clean_logit - ablated_logit:.2f}")

In [None]:
# Ablate all top induction heads
top_heads = [(idx.item() // model.cfg.n_heads, idx.item() % model.cfg.n_heads) 
             for idx in torch.topk(flat_scores, k=5).indices]

hooks = [(f"blocks.{l}.attn.hook_z", ablate_head(l, h)) for l, h in top_heads]

with model.hooks(fwd_hooks=hooks):
    multi_ablated_logit, multi_ablated_rank = get_induction_loss(model, test_text, target)

print(f"With top 5 induction heads ablated:")
print(f"  Target logit: {multi_ablated_logit:.2f} (was {clean_logit:.2f})")
print(f"  Target rank: {multi_ablated_rank} (was {clean_rank})")
print(f"\n  Total logit drop: {clean_logit - multi_ablated_logit:.2f}")

## 5. The Two-Layer Circuit

Induction heads work via composition with "previous token" heads in earlier layers.

In [None]:
def measure_prev_token_score(cache, seq_len):
    """
    Measure how much each head attends to the previous token.
    Previous token heads should have high attention on the diagonal-1.
    """
    scores = torch.zeros(model.cfg.n_layers, model.cfg.n_heads)
    
    for layer in range(model.cfg.n_layers):
        attn = cache["pattern", layer][0]  # (heads, seq, seq)
        
        for head in range(model.cfg.n_heads):
            # Sum attention to previous token (diagonal - 1)
            prev_attn = 0
            for pos in range(1, seq_len):
                prev_attn += attn[head, pos, pos-1].item()
            scores[layer, head] = prev_attn / (seq_len - 1)
    
    return scores

prev_token_scores = measure_prev_token_score(repeated_cache, len(token_strs))

# Find top previous token heads
flat_prev_scores = prev_token_scores.flatten()
top_prev_indices = torch.topk(flat_prev_scores, k=5).indices

print("Top 5 'previous token' heads:")
for idx in top_prev_indices:
    layer = idx.item() // model.cfg.n_heads
    head = idx.item() % model.cfg.n_heads
    score = prev_token_scores[layer, head].item()
    print(f"  Layer {layer}, Head {head}: {score:.3f}")

In [None]:
# Compare: previous token heads are in early layers, induction heads in later layers
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

im1 = axes[0].imshow(prev_token_scores.numpy(), cmap='Blues', aspect='auto')
axes[0].set_xlabel('Head')
axes[0].set_ylabel('Layer')
axes[0].set_title('Previous Token Scores\n(Early layers)')
plt.colorbar(im1, ax=axes[0])

im2 = axes[1].imshow(induction_scores.numpy(), cmap='Reds', aspect='auto')
axes[1].set_xlabel('Head')
axes[1].set_ylabel('Layer')
axes[1].set_title('Induction Scores\n(Later layers)')
plt.colorbar(im2, ax=axes[1])

plt.tight_layout()
plt.show()

print("Notice: Previous token heads cluster in early layers (0-2)")
print("        Induction heads cluster in later layers (5+)")
print("This is the two-layer circuit: early heads enable later heads!")

## Exercises

### Exercise 1: Test on different patterns
Try longer repeated sequences. Does the induction pattern still work?

### Exercise 2: Ablate previous token heads
What happens if you ablate the previous token heads instead of the induction heads?

### Exercise 3: Find the composition
Can you measure how the induction head queries are influenced by the previous token head outputs?

In [None]:
# Exercise 1: Your code here
longer_text = "X Y Z W X Y Z W X Y Z W X Y"
# Does induction still work?

## Summary

You've now:
1. Found induction heads in GPT-2 using attention pattern analysis
2. Visualized the distinctive "stripe" pattern of induction attention
3. Verified causality through ablation
4. Identified the two-layer circuit: previous token heads → induction heads

This is the complete mechanistic interpretability workflow: 
**observe → hypothesize → intervene → verify**

**Next:** [Chapter 14: Open Problems](https://ttsugriy.github.io/mechinterp-first-principles/chapters/14-open-problems.html)