flowchart LR
subgraph ARC3["Arc III: The Toolkit"]
A9["Chapter 9<br/>SAEs"] --> A10["Chapter 10<br/>Attribution"]
A10 --> A11["Chapter 11<br/>Patching"]
A11 --> A12["Chapter 12<br/>Ablation"]
end
18 Arc III Summary
The interpretability toolkit: what each technique does and when to use it
You’ve completed Arc III: Techniques. Before moving to Arc IV (Synthesis), consolidate your understanding of the interpretability toolkit.
18.1 The Big Picture
You now have a complete methodology for reverse-engineering neural networks:
18.2 Key Concepts to Remember
18.2.1 From Chapter 9: Sparse Autoencoders
- SAEs decompose polysemantic activations into monosemantic features
- Architecture: Overcomplete (more latent dims than input) + sparse (most are zero)
- Training: Reconstruction loss + sparsity loss (L1 penalty)
- Trade-off: More sparsity → more interpretable, but worse reconstruction
18.2.2 From Chapter 10: Attribution
- Attribution decomposes the output into per-component contributions
- Key insight: The logit is a sum, so we can see what each component added
- Logit lens: Peek at intermediate predictions by applying unembedding early
- Limitation: Attribution shows correlation, not causation
18.2.3 From Chapter 11: Activation Patching
- Patching proves causation by surgically replacing activations
- Clean/corrupted paradigm: Two inputs, one correct, one wrong
- Patch from clean → corrupted: If output recovers, component is causal
- Path patching: Trace specific information flows between components
18.2.4 From Chapter 12: Ablation
- Ablation removes a component entirely and measures the effect
- Zero ablation: Set to zero (simple but can cause distribution shift)
- Mean ablation: Set to dataset mean (preserves magnitude, removes specificity)
- Resample ablation: Replace with value from different input (preserves distribution)
18.3 The Interpretability Pipeline
flowchart TD
subgraph DISCOVERY["Phase 1: Discovery"]
SAE["SAE Feature Extraction<br/>What concepts exist?"]
ATTR["Attribution<br/>What correlates with output?"]
end
subgraph VALIDATION["Phase 2: Validation"]
PATCH["Patching<br/>What causes the behavior?"]
ABL["Ablation<br/>What's necessary?"]
end
subgraph RESULT["Phase 3: Understanding"]
CIRCUIT["Circuit Description<br/>Features + Connections + Roles"]
end
SAE --> ATTR
ATTR --> PATCH
PATCH --> ABL
ABL --> CIRCUIT
18.4 When to Use Each Technique
| Technique | Question Answered | Cost | Use When |
|---|---|---|---|
| SAE | What concepts exist here? | Training time | You want interpretable features |
| Attribution | What contributed to this output? | 1-2 forward passes | Generating hypotheses |
| Patching | Is this component causally necessary? | ~100 passes/component | Testing hypotheses |
| Ablation | What happens without this component? | 1 pass/component | Finding minimal circuits |
18.5 Self-Test: Can You Answer These?
Cost efficiency. Attribution requires only 1-2 forward passes to identify all component contributions. Patching requires ~100 passes per component. Use attribution to narrow from 144 attention heads to the top 10 candidates, then patch only those.
The component is correlated but not necessary. Its output points toward the correct answer, but other paths carry the same information. This is common in large models with redundant circuits (backup pathways).
- Zero ablation: Simpler, but can cause distribution shift (the network never saw zero during training)
- Mean ablation: Preserves the expected magnitude, removes only input-specific information
Use mean ablation when you want cleaner measurements. Use zero ablation for quick exploration or when you need a stronger intervention.
18.6 Achievements Unlocked
After completing Arc III, you have new abilities:
- Train or use an SAE to extract interpretable features from model activations
- Run attribution analysis to find which components contribute to a prediction
- Design clean/corrupted pairs for patching experiments
- Distinguish correlation from causation using interventional techniques
- Choose the right technique for each question: SAEs for discovery, attribution for hypotheses, patching for causation, ablation for necessity
You have a complete interpretability toolkit. You can investigate any behavior in any transformer. Arc IV shows these tools in action and reveals their limits.
18.7 What’s Next
Arc IV brings it all together:
- Chapter 13 (Induction Heads): A complete case study applying all techniques
- Chapter 14 (Open Problems): What we can’t do yet and why
- Chapter 15 (Practice Regime): How to actually do interpretability research
You have the tools. Now you’ll see them applied—and understand their limits.