25 Quick Reference

Cheat sheets for mechanistic interpretability techniques

Printable one-page summaries for each core technique. Keep these open while working.

25.1 TransformerLens Essentials

import transformer_lens as tl

# Load a model
model = tl.HookedTransformer.from_pretrained("gpt2-small")

# Basic inference
logits, cache = model.run_with_cache("Hello world")

# Access activations
resid = cache["resid_post", 5]        # Residual stream after layer 5
attn = cache["pattern", 3]             # Attention patterns for layer 3
mlp_out = cache["mlp_out", 7]          # MLP output at layer 7

# Get specific head output
head_out = cache["z", 4][:, :, 2, :]   # Layer 4, head 2 output

# Token info
tokens = model.to_tokens("Hello world")
str_tokens = model.to_str_tokens("Hello world")

25.2 Technique Cheat Sheets

25.2.1 Attribution

What	Code Pattern	Interpretation
Logit attribution	`(component_output @ model.W_U)[:, -1, target_token]`	How much did this component push toward the target token?
Direct logit attribution	`cache.stack_head_results() @ model.W_U[:, target]`	Per-head contribution to target logit
Logit lens	`model.unembed(cache["resid_post", layer])`	What would the model predict at this layer?

Key insight: Attribution is correlational, not causal. Always validate with patching.

# Quick logit attribution
def logit_attribution(cache, model, target_token: int) -> list[tuple[str, float]]:
    """Attribute final logit to each component."""
    target_dir = model.W_U[:, target_token]  # Direction in vocab space
    contributions = {}

    for layer in range(model.cfg.n_layers):
        # Each attention head's contribution (compute from z and W_O)
        z = cache["z", layer][0, -1]  # [n_heads, d_head]
        W_O = model.W_O[layer]  # [n_heads, d_head, d_model]
        for head in range(model.cfg.n_heads):
            head_out = z[head] @ W_O[head]  # [d_model]
            contributions[f"L{layer}H{head}"] = (head_out @ target_dir).item()

        # MLP's contribution
        mlp_out = cache["mlp_out", layer][0, -1]
        contributions[f"L{layer}_MLP"] = (mlp_out @ target_dir).item()

    return sorted(contributions.items(), key=lambda x: -abs(x[1]))

25.2.2 Activation Patching

Type	What’s Patched	Use Case
Residual patching	Residual stream at a layer	Where is information stored?
Attention patching	Attention head output	Which heads matter?
MLP patching	MLP output	Which MLPs matter?
Path patching	Specific paths (e.g., head→head)	How does information flow?

Key insight: Patching is causal. It tells you what’s necessary, not what’s sufficient.

# Basic activation patching template
def patch_activation(model, clean_input: str, corrupted_input: str,
                     layer: int, component: str = "resid_post"):
    """Patch component from corrupted into clean run."""
    _, corrupted_cache = model.run_with_cache(corrupted_input)
    corrupted_act = corrupted_cache[component, layer]

    def patch_hook(act, hook):
        act[:] = corrupted_act
        return act

    return model.run_with_hooks(
        clean_input,
        fwd_hooks=[(f"blocks.{layer}.hook_{component}", patch_hook)]
    )

Noising vs Denoising:

Noising (clean → corrupted): “Does breaking this break the behavior?”
Denoising (corrupted → clean): “Does fixing this fix the behavior?”

25.2.3 Ablation

Type	Method	Pros	Cons
Zero ablation	Set to 0	Simple	Distribution shift
Mean ablation	Set to dataset mean	Less distribution shift	Loses input-dependence
Resample ablation	Random other input	Preserves distribution	Noisy

Key insight: Ablation tests necessity. A component that survives ablation might have backup circuits.

# Mean ablation template
def mean_ablate_head(model, input_text, layer, head, mean_cache):
    """Ablate a specific head with its mean activation."""
    mean_act = mean_cache["z", layer][:, :, head, :].mean(dim=0)

    def ablate_hook(z, hook):
        z[:, :, head, :] = mean_act
        return z

    return model.run_with_hooks(
        input_text,
        fwd_hooks=[(f"blocks.{layer}.attn.hook_z", ablate_hook)]
    )

25.2.4 SAE Feature Analysis

Task	Code Pattern
Load SAE	`sae, cfg, sparsity = SAE.from_pretrained(release, sae_id)`
Encode	`feature_acts = sae.encode(activations)`
Decode	`reconstructed = sae.decode(feature_acts)`
Top features	`top_features = feature_acts.topk(k=10)`
Feature direction	`sae.W_dec[feature_idx]`

from sae_lens import SAE

# Load SAE for GPT-2 layer 8
sae, cfg, sparsity = SAE.from_pretrained(
    release="gpt2-small-res-jb",
    sae_id="blocks.8.hook_resid_pre"
)

# Get feature activations
_, cache = model.run_with_cache(text)
acts = cache["resid_pre", 8]
feature_acts = sae.encode(acts)

# Find top-activating features
top_k = feature_acts[0, -1].topk(10)
for idx, val in zip(top_k.indices, top_k.values):
    print(f"Feature {idx}: {val:.2f}")

Steering with features:

# Add feature direction during generation
def steering_hook(resid, hook):
    feature_dir = sae.W_dec[feature_idx]
    resid[:, :, :] += strength * feature_dir
    return resid

25.3 Common Hook Points

Hook Name	Shape	What It Contains
`hook_embed`	`[batch, seq, d_model]`	Token embeddings
`hook_pos_embed`	`[batch, seq, d_model]`	Position embeddings
`blocks.{L}.hook_resid_pre`	`[batch, seq, d_model]`	Residual before layer L
`blocks.{L}.hook_resid_post`	`[batch, seq, d_model]`	Residual after layer L
`blocks.{L}.attn.hook_q`	`[batch, seq, n_heads, d_head]`	Queries
`blocks.{L}.attn.hook_k`	`[batch, seq, n_heads, d_head]`	Keys
`blocks.{L}.attn.hook_v`	`[batch, seq, n_heads, d_head]`	Values
`blocks.{L}.attn.hook_z`	`[batch, seq, n_heads, d_head]`	Head outputs (before W_O)
`blocks.{L}.attn.hook_pattern`	`[batch, n_heads, seq, seq]`	Attention patterns
`blocks.{L}.attn.hook_result`	`[batch, seq, n_heads, d_model]`	Head outputs (after W_O)
`blocks.{L}.hook_mlp_out`	`[batch, seq, d_model]`	MLP output

25.4 Model Dimensions Quick Reference

Model	Layers	Heads	d_model	d_head	d_mlp	Vocab
GPT-2 Small	12	12	768	64	3072	50257
GPT-2 Medium	24	16	1024	64	4096	50257
GPT-2 Large	36	20	1280	64	5120	50257
GPT-2 XL	48	25	1600	64	6400	50257
Pythia-70M	6	8	512	64	2048	50304
Pythia-160M	12	12	768	64	3072	50304
Pythia-410M	24	16	1024	64	4096	50304
Pythia-1B	16	8	2048	256	8192	50304

25.5 Decision Flowchart: Which Technique?

Start: "I want to understand behavior X"
  │
  ├─► "What correlates with X?"
  │     → Use ATTRIBUTION (fast, one forward pass)
  │     → Validate findings with patching
  │
  ├─► "What causes X?"
  │     → Use PATCHING (slower, many forward passes)
  │     → Compare clean vs corrupted inputs
  │
  ├─► "What's necessary for X?"
  │     → Use ABLATION
  │     → Try mean ablation first, then resample
  │
  └─► "What concepts are involved in X?"
        → Use SAE FEATURES
        → Check Neuronpedia for interpretation
        → Validate with steering

25.6 Validation Checklist

Before claiming you’ve found a circuit:

Attribution: Identified top contributing components
Patching: Confirmed causal role (patching changes output)
Ablation: Confirmed necessity (ablation breaks behavior)
Multiple examples: Pattern holds across diverse inputs
Edge cases: Tested failure modes and boundaries
Minimal circuit: Removed components that don’t change behavior
Interpretation: Can explain why the circuit works

25.7 Comparison Tables

25.7.1 Attention vs MLPs

Aspect	Attention Heads	MLPs
Operation	Move information between positions	Transform information at each position
Function	“What to look at”	“What to do with it”
Knowledge	Patterns, relationships	Facts, associations
Example	Induction heads, name movers	Factual recall, arithmetic
Ablation effect	Breaks attention-dependent behaviors	Breaks knowledge-dependent behaviors

25.7.2 Patching vs Ablation

Aspect	Patching	Ablation
Question	“Where is this info processed?”	“Is this component necessary?”
Method	Swap activations between runs	Remove or zero out component
Requires	Two contrasting inputs	Single input + baseline
Measures	Sufficiency for change	Necessity for behavior
Limitation	Needs good corrupted input	Backup circuits mask importance

25.7.3 Noising vs Denoising Patching

Aspect	Noising	Denoising
Direction	Clean → Corrupted	Corrupted → Clean
Question	“Does breaking X break behavior?”	“Does restoring X restore behavior?”
Finds	What’s necessary	What’s sufficient
Bias	May miss redundant components	May overestimate component importance

25.7.4 SAE Metrics

Metric	Measures	Good Values
L0 (sparsity)	Average features active	20-100 for residual SAEs
Reconstruction loss	How much info is lost	Lower is better (<10% degradation)
CE loss recovered	Task performance after SAE	>90% of original
Feature absorption	Do features capture consistent concepts?	Minimize with larger dictionaries

25.7.5 Hook Types for Different Goals

Goal	Primary Hook	Why
Read attention patterns	`hook_pattern`	Direct access to attention weights
Modify what heads attend to	`hook_q` or `hook_k`	Changes attention computation
Change what heads output	`hook_z` or `hook_result`	Modifies head contribution
Patch entire layer	`hook_resid_post`	Captures all layer computation
Inject features	`hook_resid_pre`	Before next layer processes

25.8 Useful Links

--- title: "Quick Reference" subtitle: "Cheat sheets for mechanistic interpretability techniques" --- Printable one-page summaries for each core technique. Keep these open while working. ## TransformerLens Essentials ```python import transformer_lens as tl # Load a model model = tl.HookedTransformer.from_pretrained("gpt2-small") # Basic inference logits, cache = model.run_with_cache("Hello world") # Access activations resid = cache["resid_post", 5] # Residual stream after layer 5 attn = cache["pattern", 3] # Attention patterns for layer 3 mlp_out = cache["mlp_out", 7] # MLP output at layer 7 # Get specific head output head_out = cache["z", 4][:, :, 2, :] # Layer 4, head 2 output # Token info tokens = model.to_tokens("Hello world") str_tokens = model.to_str_tokens("Hello world") ``` ## Technique Cheat Sheets ### Attribution | What | Code Pattern | Interpretation | |------|--------------|----------------| | **Logit attribution** | `(component_output @ model.W_U)[:, -1, target_token]` | How much did this component push toward the target token? | | **Direct logit attribution** | `cache.stack_head_results() @ model.W_U[:, target]` | Per-head contribution to target logit | | **Logit lens** | `model.unembed(cache["resid_post", layer])` | What would the model predict at this layer? | **Key insight**: Attribution is *correlational*, not causal. Always validate with patching. ```python # Quick logit attribution def logit_attribution(cache, model, target_token: int) -> list[tuple[str, float]]: """Attribute final logit to each component.""" target_dir = model.W_U[:, target_token] # Direction in vocab space contributions = {} for layer in range(model.cfg.n_layers): # Each attention head's contribution (compute from z and W_O) z = cache["z", layer][0, -1] # [n_heads, d_head] W_O = model.W_O[layer] # [n_heads, d_head, d_model] for head in range(model.cfg.n_heads): head_out = z[head] @ W_O[head] # [d_model] contributions[f"L{layer}H{head}"] = (head_out @ target_dir).item() # MLP's contribution mlp_out = cache["mlp_out", layer][0, -1] contributions[f"L{layer}_MLP"] = (mlp_out @ target_dir).item() return sorted(contributions.items(), key=lambda x: -abs(x[1])) ``` --- ### Activation Patching | Type | What's Patched | Use Case | |------|----------------|----------| | **Residual patching** | Residual stream at a layer | Where is information stored? | | **Attention patching** | Attention head output | Which heads matter? | | **MLP patching** | MLP output | Which MLPs matter? | | **Path patching** | Specific paths (e.g., head→head) | How does information flow? | **Key insight**: Patching is *causal*. It tells you what's necessary, not what's sufficient. ```python # Basic activation patching template def patch_activation(model, clean_input: str, corrupted_input: str, layer: int, component: str = "resid_post"): """Patch component from corrupted into clean run.""" _, corrupted_cache = model.run_with_cache(corrupted_input) corrupted_act = corrupted_cache[component, layer] def patch_hook(act, hook): act[:] = corrupted_act return act return model.run_with_hooks( clean_input, fwd_hooks=[(f"blocks.{layer}.hook_{component}", patch_hook)] ) ``` **Noising vs Denoising**: - **Noising** (clean → corrupted): "Does breaking this break the behavior?" - **Denoising** (corrupted → clean): "Does fixing this fix the behavior?" --- ### Ablation | Type | Method | Pros | Cons | |------|--------|------|------| | **Zero ablation** | Set to 0 | Simple | Distribution shift | | **Mean ablation** | Set to dataset mean | Less distribution shift | Loses input-dependence | | **Resample ablation** | Random other input | Preserves distribution | Noisy | **Key insight**: Ablation tests *necessity*. A component that survives ablation might have backup circuits. ```python # Mean ablation template def mean_ablate_head(model, input_text, layer, head, mean_cache): """Ablate a specific head with its mean activation.""" mean_act = mean_cache["z", layer][:, :, head, :].mean(dim=0) def ablate_hook(z, hook): z[:, :, head, :] = mean_act return z return model.run_with_hooks( input_text, fwd_hooks=[(f"blocks.{layer}.attn.hook_z", ablate_hook)] ) ``` --- ### SAE Feature Analysis | Task | Code Pattern | |------|--------------| | **Load SAE** | `sae, cfg, sparsity = SAE.from_pretrained(release, sae_id)` | | **Encode** | `feature_acts = sae.encode(activations)` | | **Decode** | `reconstructed = sae.decode(feature_acts)` | | **Top features** | `top_features = feature_acts.topk(k=10)` | | **Feature direction** | `sae.W_dec[feature_idx]` | ```python from sae_lens import SAE # Load SAE for GPT-2 layer 8 sae, cfg, sparsity = SAE.from_pretrained( release="gpt2-small-res-jb", sae_id="blocks.8.hook_resid_pre" ) # Get feature activations _, cache = model.run_with_cache(text) acts = cache["resid_pre", 8] feature_acts = sae.encode(acts) # Find top-activating features top_k = feature_acts[0, -1].topk(10) for idx, val in zip(top_k.indices, top_k.values): print(f"Feature {idx}: {val:.2f}") ``` **Steering with features**: ```python # Add feature direction during generation def steering_hook(resid, hook): feature_dir = sae.W_dec[feature_idx] resid[:, :, :] += strength * feature_dir return resid ``` --- ## Common Hook Points | Hook Name | Shape | What It Contains | |-----------|-------|------------------| | `hook_embed` | `[batch, seq, d_model]` | Token embeddings | | `hook_pos_embed` | `[batch, seq, d_model]` | Position embeddings | | `blocks.{L}.hook_resid_pre` | `[batch, seq, d_model]` | Residual before layer L | | `blocks.{L}.hook_resid_post` | `[batch, seq, d_model]` | Residual after layer L | | `blocks.{L}.attn.hook_q` | `[batch, seq, n_heads, d_head]` | Queries | | `blocks.{L}.attn.hook_k` | `[batch, seq, n_heads, d_head]` | Keys | | `blocks.{L}.attn.hook_v` | `[batch, seq, n_heads, d_head]` | Values | | `blocks.{L}.attn.hook_z` | `[batch, seq, n_heads, d_head]` | Head outputs (before W_O) | | `blocks.{L}.attn.hook_pattern` | `[batch, n_heads, seq, seq]` | Attention patterns | | `blocks.{L}.attn.hook_result` | `[batch, seq, n_heads, d_model]` | Head outputs (after W_O) | | `blocks.{L}.hook_mlp_out` | `[batch, seq, d_model]` | MLP output | --- ## Model Dimensions Quick Reference | Model | Layers | Heads | d_model | d_head | d_mlp | Vocab | |-------|--------|-------|---------|--------|-------|-------| | GPT-2 Small | 12 | 12 | 768 | 64 | 3072 | 50257 | | GPT-2 Medium | 24 | 16 | 1024 | 64 | 4096 | 50257 | | GPT-2 Large | 36 | 20 | 1280 | 64 | 5120 | 50257 | | GPT-2 XL | 48 | 25 | 1600 | 64 | 6400 | 50257 | | Pythia-70M | 6 | 8 | 512 | 64 | 2048 | 50304 | | Pythia-160M | 12 | 12 | 768 | 64 | 3072 | 50304 | | Pythia-410M | 24 | 16 | 1024 | 64 | 4096 | 50304 | | Pythia-1B | 16 | 8 | 2048 | 256 | 8192 | 50304 | --- ## Decision Flowchart: Which Technique? ``` Start: "I want to understand behavior X" │ ├─► "What correlates with X?" │ → Use ATTRIBUTION (fast, one forward pass) │ → Validate findings with patching │ ├─► "What causes X?" │ → Use PATCHING (slower, many forward passes) │ → Compare clean vs corrupted inputs │ ├─► "What's necessary for X?" │ → Use ABLATION │ → Try mean ablation first, then resample │ └─► "What concepts are involved in X?" → Use SAE FEATURES → Check Neuronpedia for interpretation → Validate with steering ``` --- ## Validation Checklist Before claiming you've found a circuit: - [ ] **Attribution**: Identified top contributing components - [ ] **Patching**: Confirmed causal role (patching changes output) - [ ] **Ablation**: Confirmed necessity (ablation breaks behavior) - [ ] **Multiple examples**: Pattern holds across diverse inputs - [ ] **Edge cases**: Tested failure modes and boundaries - [ ] **Minimal circuit**: Removed components that don't change behavior - [ ] **Interpretation**: Can explain *why* the circuit works --- ## Comparison Tables ### Attention vs MLPs | Aspect | Attention Heads | MLPs | |--------|-----------------|------| | **Operation** | Move information between positions | Transform information at each position | | **Function** | "What to look at" | "What to do with it" | | **Knowledge** | Patterns, relationships | Facts, associations | | **Example** | Induction heads, name movers | Factual recall, arithmetic | | **Ablation effect** | Breaks attention-dependent behaviors | Breaks knowledge-dependent behaviors | ### Patching vs Ablation | Aspect | Patching | Ablation | |--------|----------|----------| | **Question** | "Where is this info processed?" | "Is this component necessary?" | | **Method** | Swap activations between runs | Remove or zero out component | | **Requires** | Two contrasting inputs | Single input + baseline | | **Measures** | Sufficiency for change | Necessity for behavior | | **Limitation** | Needs good corrupted input | Backup circuits mask importance | ### Noising vs Denoising Patching | Aspect | Noising | Denoising | |--------|---------|-----------| | **Direction** | Clean → Corrupted | Corrupted → Clean | | **Question** | "Does breaking X break behavior?" | "Does restoring X restore behavior?" | | **Finds** | What's necessary | What's sufficient | | **Bias** | May miss redundant components | May overestimate component importance | ### SAE Metrics | Metric | Measures | Good Values | |--------|----------|-------------| | **L0 (sparsity)** | Average features active | 20-100 for residual SAEs | | **Reconstruction loss** | How much info is lost | Lower is better (<10% degradation) | | **CE loss recovered** | Task performance after SAE | >90% of original | | **Feature absorption** | Do features capture consistent concepts? | Minimize with larger dictionaries | ### Hook Types for Different Goals | Goal | Primary Hook | Why | |------|--------------|-----| | Read attention patterns | `hook_pattern` | Direct access to attention weights | | Modify what heads attend to | `hook_q` or `hook_k` | Changes attention computation | | Change what heads output | `hook_z` or `hook_result` | Modifies head contribution | | Patch entire layer | `hook_resid_post` | Captures all layer computation | | Inject features | `hook_resid_pre` | Before next layer processes | --- ## Useful Links - [Neuronpedia](https://www.neuronpedia.org/) - Explore SAE features - [TransformerLens Docs](https://neelnanda-io.github.io/TransformerLens/) - [SAELens Docs](https://jbloomaus.github.io/SAELens/) - [Anthropic Circuits Thread](https://transformer-circuits.pub/)