25  Quick Reference

Cheat sheets for mechanistic interpretability techniques

Printable one-page summaries for each core technique. Keep these open while working.

25.1 TransformerLens Essentials

import transformer_lens as tl

# Load a model
model = tl.HookedTransformer.from_pretrained("gpt2-small")

# Basic inference
logits, cache = model.run_with_cache("Hello world")

# Access activations
resid = cache["resid_post", 5]        # Residual stream after layer 5
attn = cache["pattern", 3]             # Attention patterns for layer 3
mlp_out = cache["mlp_out", 7]          # MLP output at layer 7

# Get specific head output
head_out = cache["z", 4][:, :, 2, :]   # Layer 4, head 2 output

# Token info
tokens = model.to_tokens("Hello world")
str_tokens = model.to_str_tokens("Hello world")

25.2 Technique Cheat Sheets

25.2.1 Attribution

What Code Pattern Interpretation
Logit attribution (component_output @ model.W_U)[:, -1, target_token] How much did this component push toward the target token?
Direct logit attribution cache.stack_head_results() @ model.W_U[:, target] Per-head contribution to target logit
Logit lens model.unembed(cache["resid_post", layer]) What would the model predict at this layer?

Key insight: Attribution is correlational, not causal. Always validate with patching.

# Quick logit attribution
def logit_attribution(cache, model, target_token: int) -> list[tuple[str, float]]:
    """Attribute final logit to each component."""
    target_dir = model.W_U[:, target_token]  # Direction in vocab space
    contributions = {}

    for layer in range(model.cfg.n_layers):
        # Each attention head's contribution (compute from z and W_O)
        z = cache["z", layer][0, -1]  # [n_heads, d_head]
        W_O = model.W_O[layer]  # [n_heads, d_head, d_model]
        for head in range(model.cfg.n_heads):
            head_out = z[head] @ W_O[head]  # [d_model]
            contributions[f"L{layer}H{head}"] = (head_out @ target_dir).item()

        # MLP's contribution
        mlp_out = cache["mlp_out", layer][0, -1]
        contributions[f"L{layer}_MLP"] = (mlp_out @ target_dir).item()

    return sorted(contributions.items(), key=lambda x: -abs(x[1]))

25.2.2 Activation Patching

Type What’s Patched Use Case
Residual patching Residual stream at a layer Where is information stored?
Attention patching Attention head output Which heads matter?
MLP patching MLP output Which MLPs matter?
Path patching Specific paths (e.g., head→head) How does information flow?

Key insight: Patching is causal. It tells you what’s necessary, not what’s sufficient.

# Basic activation patching template
def patch_activation(model, clean_input: str, corrupted_input: str,
                     layer: int, component: str = "resid_post"):
    """Patch component from corrupted into clean run."""
    _, corrupted_cache = model.run_with_cache(corrupted_input)
    corrupted_act = corrupted_cache[component, layer]

    def patch_hook(act, hook):
        act[:] = corrupted_act
        return act

    return model.run_with_hooks(
        clean_input,
        fwd_hooks=[(f"blocks.{layer}.hook_{component}", patch_hook)]
    )

Noising vs Denoising:

  • Noising (clean → corrupted): “Does breaking this break the behavior?”
  • Denoising (corrupted → clean): “Does fixing this fix the behavior?”

25.2.3 Ablation

Type Method Pros Cons
Zero ablation Set to 0 Simple Distribution shift
Mean ablation Set to dataset mean Less distribution shift Loses input-dependence
Resample ablation Random other input Preserves distribution Noisy

Key insight: Ablation tests necessity. A component that survives ablation might have backup circuits.

# Mean ablation template
def mean_ablate_head(model, input_text, layer, head, mean_cache):
    """Ablate a specific head with its mean activation."""
    mean_act = mean_cache["z", layer][:, :, head, :].mean(dim=0)

    def ablate_hook(z, hook):
        z[:, :, head, :] = mean_act
        return z

    return model.run_with_hooks(
        input_text,
        fwd_hooks=[(f"blocks.{layer}.attn.hook_z", ablate_hook)]
    )

25.2.4 SAE Feature Analysis

Task Code Pattern
Load SAE sae, cfg, sparsity = SAE.from_pretrained(release, sae_id)
Encode feature_acts = sae.encode(activations)
Decode reconstructed = sae.decode(feature_acts)
Top features top_features = feature_acts.topk(k=10)
Feature direction sae.W_dec[feature_idx]
from sae_lens import SAE

# Load SAE for GPT-2 layer 8
sae, cfg, sparsity = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre"
)

# Get feature activations
_, cache = model.run_with_cache(text)
acts = cache["resid_pre", 8]
feature_acts = sae.encode(acts)

# Find top-activating features
top_k = feature_acts[0, -1].topk(10)
for idx, val in zip(top_k.indices, top_k.values):
    print(f"Feature {idx}: {val:.2f}")

Steering with features:

# Add feature direction during generation
def steering_hook(resid, hook):
    feature_dir = sae.W_dec[feature_idx]
    resid[:, :, :] += strength * feature_dir
    return resid

25.3 Common Hook Points

Hook Name Shape What It Contains
hook_embed [batch, seq, d_model] Token embeddings
hook_pos_embed [batch, seq, d_model] Position embeddings
blocks.{L}.hook_resid_pre [batch, seq, d_model] Residual before layer L
blocks.{L}.hook_resid_post [batch, seq, d_model] Residual after layer L
blocks.{L}.attn.hook_q [batch, seq, n_heads, d_head] Queries
blocks.{L}.attn.hook_k [batch, seq, n_heads, d_head] Keys
blocks.{L}.attn.hook_v [batch, seq, n_heads, d_head] Values
blocks.{L}.attn.hook_z [batch, seq, n_heads, d_head] Head outputs (before W_O)
blocks.{L}.attn.hook_pattern [batch, n_heads, seq, seq] Attention patterns
blocks.{L}.attn.hook_result [batch, seq, n_heads, d_model] Head outputs (after W_O)
blocks.{L}.hook_mlp_out [batch, seq, d_model] MLP output

25.4 Model Dimensions Quick Reference

Model Layers Heads d_model d_head d_mlp Vocab
GPT-2 Small 12 12 768 64 3072 50257
GPT-2 Medium 24 16 1024 64 4096 50257
GPT-2 Large 36 20 1280 64 5120 50257
GPT-2 XL 48 25 1600 64 6400 50257
Pythia-70M 6 8 512 64 2048 50304
Pythia-160M 12 12 768 64 3072 50304
Pythia-410M 24 16 1024 64 4096 50304
Pythia-1B 16 8 2048 256 8192 50304

25.5 Decision Flowchart: Which Technique?

Start: "I want to understand behavior X"
  │
  ├─► "What correlates with X?"
  │     → Use ATTRIBUTION (fast, one forward pass)
  │     → Validate findings with patching
  │
  ├─► "What causes X?"
  │     → Use PATCHING (slower, many forward passes)
  │     → Compare clean vs corrupted inputs
  │
  ├─► "What's necessary for X?"
  │     → Use ABLATION
  │     → Try mean ablation first, then resample
  │
  └─► "What concepts are involved in X?"
        → Use SAE FEATURES
        → Check Neuronpedia for interpretation
        → Validate with steering

25.6 Validation Checklist

Before claiming you’ve found a circuit:


25.7 Comparison Tables

25.7.1 Attention vs MLPs

Aspect Attention Heads MLPs
Operation Move information between positions Transform information at each position
Function “What to look at” “What to do with it”
Knowledge Patterns, relationships Facts, associations
Example Induction heads, name movers Factual recall, arithmetic
Ablation effect Breaks attention-dependent behaviors Breaks knowledge-dependent behaviors

25.7.2 Patching vs Ablation

Aspect Patching Ablation
Question “Where is this info processed?” “Is this component necessary?”
Method Swap activations between runs Remove or zero out component
Requires Two contrasting inputs Single input + baseline
Measures Sufficiency for change Necessity for behavior
Limitation Needs good corrupted input Backup circuits mask importance

25.7.3 Noising vs Denoising Patching

Aspect Noising Denoising
Direction Clean → Corrupted Corrupted → Clean
Question “Does breaking X break behavior?” “Does restoring X restore behavior?”
Finds What’s necessary What’s sufficient
Bias May miss redundant components May overestimate component importance

25.7.4 SAE Metrics

Metric Measures Good Values
L0 (sparsity) Average features active 20-100 for residual SAEs
Reconstruction loss How much info is lost Lower is better (<10% degradation)
CE loss recovered Task performance after SAE >90% of original
Feature absorption Do features capture consistent concepts? Minimize with larger dictionaries

25.7.5 Hook Types for Different Goals

Goal Primary Hook Why
Read attention patterns hook_pattern Direct access to attention weights
Modify what heads attend to hook_q or hook_k Changes attention computation
Change what heads output hook_z or hook_result Modifies head contribution
Patch entire layer hook_resid_post Captures all layer computation
Inject features hook_resid_pre Before next layer processes