22  The Running Example

One behavior, every technique

This page follows a single behavior through every technique in the book. As you learn each method, return here to see how it applies to the same example.

TipHow to Use This Page

This is a companion to the main chapters, not a replacement. Read the technique chapters first, then return here to see how each technique illuminates the same phenomenon from a different angle.

22.1 The Behavior: Factual Recall

We’ll analyze how GPT-2 Small completes:

“The Eiffel Tower is located in”” Paris”

This is a good running example because:

  • It’s simple enough to fully analyze
  • It involves both attention (finding relevant context) and MLPs (storing knowledge)
  • It’s different from induction heads (pattern completion vs. factual recall)
  • It can be studied with every technique we’ll learn

22.2 Setup (Used Throughout)

import torch
import transformer_lens as tl

model = tl.HookedTransformer.from_pretrained("gpt2-small")

prompt = "The Eiffel Tower is located in"
answer = " Paris"

tokens = model.to_tokens(prompt)
answer_token = model.to_single_token(answer)

# Verify the model knows this
logits = model(tokens)
top_token = logits[0, -1].argmax()
print(f"Model predicts: '{model.tokenizer.decode(top_token)}'")
# Should print: Model predicts: ' Paris'

22.3 Arc I: Understanding the Substrate

22.3.1 Chapter 2: The Forward Pass

At the most basic level, the prediction happens through: 1. Tokens are embedded 2. 12 layers of attention + MLP process the embeddings 3. Final residual stream is unembedded to logits 4. “Paris” has the highest logit

# The forward pass is just matrix multiplications
logits, cache = model.run_with_cache(tokens)

print(f"Input shape: {tokens.shape}")  # [1, 7] - batch, sequence
print(f"Final residual: {cache['resid_post', 11].shape}")  # [1, 7, 768]
print(f"Logits shape: {logits.shape}")  # [1, 7, 50257]
print(f"Paris logit: {logits[0, -1, answer_token]:.2f}")

22.3.2 Chapter 3: The Residual Stream

The prediction “Paris” is the sum of all component contributions:

# Decompose the final logit
paris_dir = model.W_U[:, answer_token]  # Unembedding direction for Paris

contributions = {}

# Embedding
embed = cache["embed"][0, -1]
contributions["embed"] = (embed @ paris_dir).item()

# Each layer
for layer in range(12):
    attn_out = cache["attn_out", layer][0, -1]
    mlp_out = cache["mlp_out", layer][0, -1]
    contributions[f"L{layer}_attn"] = (attn_out @ paris_dir).item()
    contributions[f"L{layer}_mlp"] = (mlp_out @ paris_dir).item()

# Sort by contribution
sorted_contrib = sorted(contributions.items(), key=lambda x: x[1], reverse=True)
print("Top contributors to 'Paris' logit:")
for name, value in sorted_contrib[:5]:
    print(f"  {name}: {value:+.2f}")

Typical finding: Mid-to-late MLPs contribute most to factual recall.

TipTry It

Change the prompt to “The Louvre is located in”. Which components now contribute most? Are they the same layers, or does the model use different circuits for different facts?

22.3.3 Chapter 4: Geometry

The “Paris” direction is just one direction in 768-dimensional space:

# What's the cosine similarity between Paris and France?
france_token = model.to_single_token(" France")
paris_dir = model.W_U[:, answer_token]
france_dir = model.W_U[:, france_token]

cos_sim = torch.cosine_similarity(paris_dir, france_dir, dim=0)
print(f"Cosine similarity Paris-France: {cos_sim:.3f}")
# These directions are related but not identical

22.4 Arc II: Core Theory

22.4.1 Chapter 5: Features

The residual stream contains features—but which ones are relevant here?

# The residual stream at the final position encodes information
# that will be used to predict "Paris"

final_resid = cache["resid_post", 11][0, -1]  # [768]

# This vector encodes features like:
# - "We're completing a sentence about location"
# - "The subject is a famous landmark"
# - "The landmark is the Eiffel Tower"
# - "The Eiffel Tower is in Paris"

# The model "knows" Paris because these features activate
# and together they point toward the Paris direction

22.4.2 Chapter 6: Superposition

Many features are packed into 768 dimensions:

# The model needs to represent:
# - Thousands of landmarks
# - Thousands of cities
# - Relationships between them
# All in 768 dimensions

# This works because features are sparse:
# "Eiffel Tower" rarely co-occurs with "Big Ben" in the same position
# So they can share dimensions without interference

22.4.3 Chapter 8: Circuits

The factual recall involves a circuit: 1. Attention heads attend to “Eiffel Tower” from the final position 2. MLPs in middle layers retrieve the associated location 3. Later layers refine the prediction

# Which attention heads look at "Eiffel Tower"?
eiffel_pos = 1  # Position of "Eiffel" token

for layer in range(12):
    pattern = cache["pattern", layer][0, :, -1, :]  # [n_heads, seq_len]
    for head in range(12):
        attn_to_eiffel = pattern[head, eiffel_pos].item()
        if attn_to_eiffel > 0.1:
            print(f"L{layer}H{head} attends to 'Eiffel' with weight {attn_to_eiffel:.2f}")

22.5 Arc III: Techniques

22.5.1 Chapter 9: SAE Features

Extract interpretable features using sparse autoencoders:

from sae_lens import SAE

# Load SAE for layer 8 (where factual recall happens)
sae, cfg, _ = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre"
)

# Get features active at the final position
resid = cache["resid_pre", 8][0, -1]  # [768]
feature_acts = sae.encode(resid.unsqueeze(0))  # [1, n_features]

# Top features
top_k = feature_acts[0].topk(10)
print("Top active SAE features:")
for idx, val in zip(top_k.indices, top_k.values):
    print(f"  Feature {idx.item()}: {val.item():.2f}")
    # Look these up on Neuronpedia to see what they represent!

What you might find: Features for “famous landmarks”, “European cities”, “France-related concepts”.

22.5.2 Chapter 10: Attribution

Measure each component’s contribution to predicting “Paris”:

def logit_attribution(cache, model, target_token: int) -> list[tuple[str, float]]:
    """Full attribution to target token."""
    target_dir = model.W_U[:, target_token]
    results = []

    for layer in range(model.cfg.n_layers):
        # Per-head attribution: compute each head's contribution from z and W_O
        z = cache["z", layer][0, -1]  # [n_heads, d_head]
        W_O = model.W_O[layer]  # [n_heads, d_head, d_model]
        for head in range(model.cfg.n_heads):
            head_out = z[head] @ W_O[head]  # [d_model]
            results.append((f"L{layer}H{head}", (head_out @ target_dir).item()))

        # MLP attribution
        mlp_out = cache["mlp_out", layer][0, -1]
        results.append((f"L{layer}_MLP", (mlp_out @ target_dir).item()))

    return sorted(results, key=lambda x: -x[1])

top_components = logit_attribution(cache, model, answer_token)[:10]
print("Top 10 components for 'Paris':")
for name, contrib in top_components:
    print(f"  {name}: {contrib:+.2f}")

Typical finding: MLP layers 7-9 contribute most.

22.5.3 Chapter 11: Patching

Test causality: Does patching from a different landmark change the prediction?

# Corrupted prompt: same structure, different landmark
corrupted_prompt = "The Colosseum is located in"
corrupted_tokens = model.to_tokens(corrupted_prompt)
_, corrupted_cache = model.run_with_cache(corrupted_tokens)

# Expected answer for corrupted: " Rome"
rome_token = model.to_single_token(" Rome")

def patch_layer(layer: int, component: str = "resid_post") -> tuple[float, float]:
    """Patch layer from corrupted into clean."""
    corrupted_act = corrupted_cache[component, layer]

    def hook(act, hook):
        act[:, -1, :] = corrupted_act[:, -1, :]
        return act

    patched_logits = model.run_with_hooks(
        tokens,
        fwd_hooks=[(f"blocks.{layer}.hook_{component}", hook)]
    )
    return patched_logits[0, -1, answer_token].item(), patched_logits[0, -1, rome_token].item()

print("Layer | Paris logit | Rome logit | Prediction")
print("-" * 50)
for layer in range(12):
    paris, rome = patch_layer(layer)
    pred = "Paris" if paris > rome else "Rome"
    print(f"  {layer:2d}  |   {paris:+.2f}    |   {rome:+.2f}   | {pred}")

What to look for: At which layer does patching flip the prediction from Paris to Rome?

TipTry It

What happens if you patch from “The Big Ben is located in” instead? Does the prediction flip to “London”? Try patching at the position of the landmark name (“Eiffel” vs “Big”) rather than the final position.

22.5.4 Chapter 12: Ablation

Test necessity: What happens if we ablate the key components?

def ablate_mlp(layer: int) -> float:
    """Zero-ablate an MLP layer."""
    def hook(act, hook):
        return torch.zeros_like(act)

    ablated_logits = model.run_with_hooks(
        tokens,
        fwd_hooks=[(f"blocks.{layer}.hook_mlp_out", hook)]
    )
    return ablated_logits[0, -1, answer_token].item()

original_logit = logits[0, -1, answer_token].item()
print(f"Original Paris logit: {original_logit:.2f}")
print("\nAblating each MLP:")
for layer in range(12):
    ablated = ablate_mlp(layer)
    change = ablated - original_logit
    print(f"  Layer {layer}: {ablated:.2f} (change: {change:+.2f})")

What to look for: Which layers, when ablated, most reduce the Paris logit?

TipTry It

Try mean-ablation instead of zero-ablation: replace the MLP output with its average over a batch of random inputs. This can reveal whether the effect is due to the specific computation or just the magnitude of the output.


22.6 Arc IV: Synthesis

22.6.1 Chapter 13: Comparison to Induction Heads

Factual recall and induction are different mechanisms:

Aspect Induction Heads Factual Recall
What Pattern completion [A][B]…[A] → [B] Knowledge retrieval
Where Attention heads (layers 1-2) MLPs (layers 7-9)
How Attention pattern matching MLP key-value lookup
Requires Pattern in context Knowledge in weights

22.6.2 The Complete Picture

For “The Eiffel Tower is located in” → ” Paris”:

  1. Embedding: Tokens become vectors
  2. Early attention: Heads attend to “Eiffel Tower”, building context
  3. Mid MLPs (layers 7-9): Retrieve “Paris” from learned associations
  4. Late layers: Refine and commit to prediction
  5. Unembedding: “Paris” direction has highest logit

22.7 Summary: One Behavior, All Techniques

Technique What We Learn About Factual Recall
Forward pass 12 layers transform “Eiffel Tower” into “Paris”
Residual stream MLPs contribute most to final prediction
Geometry “Paris” is a direction in vocabulary space
Features Relevant concepts activate at final position
SAE Can identify specific features like “France-related”
Attribution MLP 8-9 have highest logit contribution
Patching Swapping landmark info changes prediction
Ablation Mid-layer MLPs are necessary for recall

Each technique reveals a different facet of the same phenomenon. Together, they build a complete mechanistic understanding.


22.8 Try It Yourself

  1. Different fact: Try “The capital of Japan is” → “Tokyo”
  2. Harder fact: Try “The 16th president of the US was” → “Lincoln”
  3. Compare: How does the circuit differ for well-known vs. obscure facts?
  4. Break it: Find a fact the model gets wrong and analyze why