22 The Running Example
One behavior, every technique
This page follows a single behavior through every technique in the book. As you learn each method, return here to see how it applies to the same example.
This is a companion to the main chapters, not a replacement. Read the technique chapters first, then return here to see how each technique illuminates the same phenomenon from a different angle.
22.1 The Behavior: Factual Recall
We’ll analyze how GPT-2 Small completes:
“The Eiffel Tower is located in” → ” Paris”
This is a good running example because:
- It’s simple enough to fully analyze
- It involves both attention (finding relevant context) and MLPs (storing knowledge)
- It’s different from induction heads (pattern completion vs. factual recall)
- It can be studied with every technique we’ll learn
22.2 Setup (Used Throughout)
import torch
import transformer_lens as tl
model = tl.HookedTransformer.from_pretrained("gpt2-small")
prompt = "The Eiffel Tower is located in"
answer = " Paris"
tokens = model.to_tokens(prompt)
answer_token = model.to_single_token(answer)
# Verify the model knows this
logits = model(tokens)
top_token = logits[0, -1].argmax()
print(f"Model predicts: '{model.tokenizer.decode(top_token)}'")
# Should print: Model predicts: ' Paris'22.3 Arc I: Understanding the Substrate
22.3.1 Chapter 2: The Forward Pass
At the most basic level, the prediction happens through: 1. Tokens are embedded 2. 12 layers of attention + MLP process the embeddings 3. Final residual stream is unembedded to logits 4. “Paris” has the highest logit
# The forward pass is just matrix multiplications
logits, cache = model.run_with_cache(tokens)
print(f"Input shape: {tokens.shape}") # [1, 7] - batch, sequence
print(f"Final residual: {cache['resid_post', 11].shape}") # [1, 7, 768]
print(f"Logits shape: {logits.shape}") # [1, 7, 50257]
print(f"Paris logit: {logits[0, -1, answer_token]:.2f}")22.3.2 Chapter 3: The Residual Stream
The prediction “Paris” is the sum of all component contributions:
# Decompose the final logit
paris_dir = model.W_U[:, answer_token] # Unembedding direction for Paris
contributions = {}
# Embedding
embed = cache["embed"][0, -1]
contributions["embed"] = (embed @ paris_dir).item()
# Each layer
for layer in range(12):
attn_out = cache["attn_out", layer][0, -1]
mlp_out = cache["mlp_out", layer][0, -1]
contributions[f"L{layer}_attn"] = (attn_out @ paris_dir).item()
contributions[f"L{layer}_mlp"] = (mlp_out @ paris_dir).item()
# Sort by contribution
sorted_contrib = sorted(contributions.items(), key=lambda x: x[1], reverse=True)
print("Top contributors to 'Paris' logit:")
for name, value in sorted_contrib[:5]:
print(f" {name}: {value:+.2f}")Typical finding: Mid-to-late MLPs contribute most to factual recall.
Change the prompt to “The Louvre is located in”. Which components now contribute most? Are they the same layers, or does the model use different circuits for different facts?
22.3.3 Chapter 4: Geometry
The “Paris” direction is just one direction in 768-dimensional space:
# What's the cosine similarity between Paris and France?
france_token = model.to_single_token(" France")
paris_dir = model.W_U[:, answer_token]
france_dir = model.W_U[:, france_token]
cos_sim = torch.cosine_similarity(paris_dir, france_dir, dim=0)
print(f"Cosine similarity Paris-France: {cos_sim:.3f}")
# These directions are related but not identical22.4 Arc II: Core Theory
22.4.1 Chapter 5: Features
The residual stream contains features—but which ones are relevant here?
# The residual stream at the final position encodes information
# that will be used to predict "Paris"
final_resid = cache["resid_post", 11][0, -1] # [768]
# This vector encodes features like:
# - "We're completing a sentence about location"
# - "The subject is a famous landmark"
# - "The landmark is the Eiffel Tower"
# - "The Eiffel Tower is in Paris"
# The model "knows" Paris because these features activate
# and together they point toward the Paris direction22.4.2 Chapter 6: Superposition
Many features are packed into 768 dimensions:
# The model needs to represent:
# - Thousands of landmarks
# - Thousands of cities
# - Relationships between them
# All in 768 dimensions
# This works because features are sparse:
# "Eiffel Tower" rarely co-occurs with "Big Ben" in the same position
# So they can share dimensions without interference22.4.3 Chapter 8: Circuits
The factual recall involves a circuit: 1. Attention heads attend to “Eiffel Tower” from the final position 2. MLPs in middle layers retrieve the associated location 3. Later layers refine the prediction
# Which attention heads look at "Eiffel Tower"?
eiffel_pos = 1 # Position of "Eiffel" token
for layer in range(12):
pattern = cache["pattern", layer][0, :, -1, :] # [n_heads, seq_len]
for head in range(12):
attn_to_eiffel = pattern[head, eiffel_pos].item()
if attn_to_eiffel > 0.1:
print(f"L{layer}H{head} attends to 'Eiffel' with weight {attn_to_eiffel:.2f}")22.5 Arc III: Techniques
22.5.1 Chapter 9: SAE Features
Extract interpretable features using sparse autoencoders:
from sae_lens import SAE
# Load SAE for layer 8 (where factual recall happens)
sae, cfg, _ = SAE.from_pretrained(
release="gpt2-small-res-jb",
sae_id="blocks.8.hook_resid_pre"
)
# Get features active at the final position
resid = cache["resid_pre", 8][0, -1] # [768]
feature_acts = sae.encode(resid.unsqueeze(0)) # [1, n_features]
# Top features
top_k = feature_acts[0].topk(10)
print("Top active SAE features:")
for idx, val in zip(top_k.indices, top_k.values):
print(f" Feature {idx.item()}: {val.item():.2f}")
# Look these up on Neuronpedia to see what they represent!What you might find: Features for “famous landmarks”, “European cities”, “France-related concepts”.
22.5.2 Chapter 10: Attribution
Measure each component’s contribution to predicting “Paris”:
def logit_attribution(cache, model, target_token: int) -> list[tuple[str, float]]:
"""Full attribution to target token."""
target_dir = model.W_U[:, target_token]
results = []
for layer in range(model.cfg.n_layers):
# Per-head attribution: compute each head's contribution from z and W_O
z = cache["z", layer][0, -1] # [n_heads, d_head]
W_O = model.W_O[layer] # [n_heads, d_head, d_model]
for head in range(model.cfg.n_heads):
head_out = z[head] @ W_O[head] # [d_model]
results.append((f"L{layer}H{head}", (head_out @ target_dir).item()))
# MLP attribution
mlp_out = cache["mlp_out", layer][0, -1]
results.append((f"L{layer}_MLP", (mlp_out @ target_dir).item()))
return sorted(results, key=lambda x: -x[1])
top_components = logit_attribution(cache, model, answer_token)[:10]
print("Top 10 components for 'Paris':")
for name, contrib in top_components:
print(f" {name}: {contrib:+.2f}")Typical finding: MLP layers 7-9 contribute most.
22.5.3 Chapter 11: Patching
Test causality: Does patching from a different landmark change the prediction?
# Corrupted prompt: same structure, different landmark
corrupted_prompt = "The Colosseum is located in"
corrupted_tokens = model.to_tokens(corrupted_prompt)
_, corrupted_cache = model.run_with_cache(corrupted_tokens)
# Expected answer for corrupted: " Rome"
rome_token = model.to_single_token(" Rome")
def patch_layer(layer: int, component: str = "resid_post") -> tuple[float, float]:
"""Patch layer from corrupted into clean."""
corrupted_act = corrupted_cache[component, layer]
def hook(act, hook):
act[:, -1, :] = corrupted_act[:, -1, :]
return act
patched_logits = model.run_with_hooks(
tokens,
fwd_hooks=[(f"blocks.{layer}.hook_{component}", hook)]
)
return patched_logits[0, -1, answer_token].item(), patched_logits[0, -1, rome_token].item()
print("Layer | Paris logit | Rome logit | Prediction")
print("-" * 50)
for layer in range(12):
paris, rome = patch_layer(layer)
pred = "Paris" if paris > rome else "Rome"
print(f" {layer:2d} | {paris:+.2f} | {rome:+.2f} | {pred}")What to look for: At which layer does patching flip the prediction from Paris to Rome?
What happens if you patch from “The Big Ben is located in” instead? Does the prediction flip to “London”? Try patching at the position of the landmark name (“Eiffel” vs “Big”) rather than the final position.
22.5.4 Chapter 12: Ablation
Test necessity: What happens if we ablate the key components?
def ablate_mlp(layer: int) -> float:
"""Zero-ablate an MLP layer."""
def hook(act, hook):
return torch.zeros_like(act)
ablated_logits = model.run_with_hooks(
tokens,
fwd_hooks=[(f"blocks.{layer}.hook_mlp_out", hook)]
)
return ablated_logits[0, -1, answer_token].item()
original_logit = logits[0, -1, answer_token].item()
print(f"Original Paris logit: {original_logit:.2f}")
print("\nAblating each MLP:")
for layer in range(12):
ablated = ablate_mlp(layer)
change = ablated - original_logit
print(f" Layer {layer}: {ablated:.2f} (change: {change:+.2f})")What to look for: Which layers, when ablated, most reduce the Paris logit?
Try mean-ablation instead of zero-ablation: replace the MLP output with its average over a batch of random inputs. This can reveal whether the effect is due to the specific computation or just the magnitude of the output.
22.6 Arc IV: Synthesis
22.6.1 Chapter 13: Comparison to Induction Heads
Factual recall and induction are different mechanisms:
| Aspect | Induction Heads | Factual Recall |
|---|---|---|
| What | Pattern completion [A][B]…[A] → [B] | Knowledge retrieval |
| Where | Attention heads (layers 1-2) | MLPs (layers 7-9) |
| How | Attention pattern matching | MLP key-value lookup |
| Requires | Pattern in context | Knowledge in weights |
22.6.2 The Complete Picture
For “The Eiffel Tower is located in” → ” Paris”:
- Embedding: Tokens become vectors
- Early attention: Heads attend to “Eiffel Tower”, building context
- Mid MLPs (layers 7-9): Retrieve “Paris” from learned associations
- Late layers: Refine and commit to prediction
- Unembedding: “Paris” direction has highest logit
22.7 Summary: One Behavior, All Techniques
| Technique | What We Learn About Factual Recall |
|---|---|
| Forward pass | 12 layers transform “Eiffel Tower” into “Paris” |
| Residual stream | MLPs contribute most to final prediction |
| Geometry | “Paris” is a direction in vocabulary space |
| Features | Relevant concepts activate at final position |
| SAE | Can identify specific features like “France-related” |
| Attribution | MLP 8-9 have highest logit contribution |
| Patching | Swapping landmark info changes prediction |
| Ablation | Mid-layer MLPs are necessary for recall |
Each technique reveals a different facet of the same phenomenon. Together, they build a complete mechanistic understanding.
22.8 Try It Yourself
- Different fact: Try “The capital of Japan is” → “Tokyo”
- Harder fact: Try “The 16th president of the US was” → “Lincoln”
- Compare: How does the circuit differ for well-known vs. obscure facts?
- Break it: Find a fact the model gets wrong and analyze why