flowchart TD
START["Experiment not working"] --> CHECK["Check basics first"]
CHECK --> Q1{"Is the model<br/>doing the task<br/>at all?"}
Q1 -->|No| FIX1["Task too hard, or<br/>wrong task definition"]
Q1 -->|Yes| Q2{"Are attributions<br/>making sense?"}
Q2 -->|No| FIX2["Wrong layer, or<br/>distributed computation"]
Q2 -->|Yes| Q3{"Do ablations<br/>have effect?"}
Q3 -->|No| FIX3["Backup circuits, or<br/>wrong ablation type"]
Q3 -->|Yes| Q4{"Is patching<br/>clean?"}
Q4 -->|No| FIX4["Bad clean/corrupted pair,<br/>or distribution shift"]
Q4 -->|Yes| SUCCESS["Circuit analysis<br/>proceeding normally"]
FIX1 --> RETRY["Simplify & Retry"]
FIX2 --> RETRY
FIX3 --> RETRY
FIX4 --> RETRY
21 A Practice Regime
From reading to research
- How to set up your environment (TransformerLens, SAELens)
- A week-by-week practice curriculum for building skills
- How to choose good research problems
- Common pitfalls and debugging strategies
Recommended: The entire series, especially Chapter 13: Induction Heads for seeing techniques in action
21.1 From Theory to Practice
You’ve read fourteen chapters. You understand: - What we’re trying to do (reverse-engineer neural networks) - Why it’s hard (superposition, scale, composition) - The concepts (features, circuits, the residual stream) - The techniques (SAEs, attribution, patching, ablation) - The case study (induction heads) - The open problems (scaling, validation, coverage)
Now what?
This final chapter is about doing—turning conceptual understanding into research practice. How do you actually find features, trace circuits, and contribute to the field?
By the end of this chapter, you should have a concrete plan for your first interpretability project—and the debugging skills to carry it through.
21.2 Starting From Zero
If you’ve never run an interpretability experiment, start here.
You will feel confused. You will run experiments that don’t work. You will misinterpret results. This is normal. The researchers who wrote the papers you’ve been reading? They felt confused too. The difference is they persisted through the confusion until patterns emerged.
Interpretability research is less “execute algorithm, get answer” and more “wander through fog, occasionally glimpse something.” The fog is part of the process.
21.2.1 Week 1: Environment Setup
1. Install TransformerLens
The standard library for transformer interpretability:
Load a model:
2. Run your first forward pass with caching
3. Explore the cache
# What's cached?
print(cache.keys())
# Look at residual stream at layer 5
residual = cache["blocks.5.hook_resid_post"]
print(residual.shape) # [batch, seq_len, d_model]
# Look at attention patterns for head 3.2
attn = cache["blocks.3.attn.hook_pattern"]
print(attn.shape) # [batch, n_heads, seq_len, seq_len]4. Visualize an attention pattern
Spend time exploring. Poke at different cache keys. Visualize different heads. Get comfortable with the API.
21.2.2 Week 2: Replicate a Known Result
Before discovering anything new, replicate something known. This verifies your setup works and builds intuition.
Suggested replication: Find an induction head
- Create repeated sequences:
- Find heads where position “CD” (second occurrence) attends to “CD” (first occurrence):
tokens = model.to_tokens(text)
_, cache = model.run_with_cache(tokens)
# For each head, measure induction pattern
for layer in range(model.cfg.n_layers):
for head in range(model.cfg.n_heads):
pattern = cache[f"blocks.{layer}.attn.hook_pattern"][0, head]
# Check if position 6 attends to position 2 (both are "CD" tokens)
induction_score = pattern[6, 2].item()
if induction_score > 0.3:
print(f"Potential induction head: {layer}.{head}, score: {induction_score:.2f}")If you find heads in layers 1-3 with high scores, you’ve found induction heads. Compare with known results to verify.
Why replication matters: You’ll make mistakes. Replication catches bugs before they matter—you know what the answer should be.
21.2.3 Week 3: Your First Original Observation
Make one small observation nobody has made before.
Suggestions: - Find what head activates most on your name - Trace what happens when you type your favorite programming language - Find the attention pattern on a specific meme or phrase
The observation doesn’t need to be important. It needs to be yours—something you discovered through exploration.
Interpretability research is exploration. The most important skill is curiosity: “What happens if I…?” Run the experiment. See what happens. Follow surprises.
21.3 Choosing a Research Problem
After exploration, you need focus. How do you choose what to work on?
21.3.1 The Problem Selection Framework
Evaluate problems on three dimensions:
1. Tractability: Can this actually be solved with current methods?
- Good: “Find the circuit for three-digit addition in GPT-2”
- Bad: “Fully explain GPT-4’s reasoning capabilities”
Start with narrow, well-defined behaviors.
2. Importance: Does the solution matter?
- Good: “Understand how models represent deception” (safety-relevant)
- Good: “Find circuits that transfer across model sizes” (methodological)
- Mediocre: “Catalog every attention pattern in layer 3” (low insight)
3. Personal fit: Is this something you can uniquely contribute to?
- Your background (performance engineering? linguistics? mathematics?)
- Your interests (what do you find fascinating?)
- Your resources (compute? collaborators? time?)
21.3.2 Concrete Problem Types
Feature discovery: What features exist? - Train SAEs on unexplored layers/models - Find features for specific domains (code, math, safety) - Study feature geometry and clustering
Circuit analysis: How does capability X work? - Pick a narrow behavior (parenthesis matching, country-capital, etc.) - Apply the full methodology: attribution → ablation → patching → diagram - Consider automated circuit discovery tools (ACDC, CD-T) for efficiency
Beyond manual patching, tools like ACDC and CD-T (Compact Discovery via Transformations) can automatically identify minimal circuits. CD-T is particularly efficient for larger models. However, automated methods still require human interpretation of the discovered components—they find which components matter, not why.
Methodology development: Better tools - Improved SAE architectures - Faster patching methods - Better visualization tools
Scaling studies: What changes with size? - Compare circuits across model sizes - Study how features evolve during training - Test whether small-model findings transfer
21.3.3 Neel Nanda’s 200 Problems
Neel Nanda maintains a list of 200 concrete open problems. Each is: - Specific enough to start on - Open enough to be original - Calibrated for difficulty
This is the best starting point for finding a project.
21.4 The Research Workflow
Once you have a problem, how do you make progress?
21.4.1 Phase 1: Hypothesis Formation
Before running experiments, write down your hypothesis:
“I believe that [capability X] is implemented by [component Y] because [reasoning Z].”
Example: “I believe that parenthesis matching is implemented by attention heads that track nesting depth, because this requires sequential counting across positions.”
Be specific. Vague hypotheses lead to vague experiments.
21.4.2 Phase 2: Baseline Measurements
Establish what you’re measuring:
- Behavioral baseline: What does the model actually do?
- Accuracy on your task
- Logit differences between correct/incorrect answers
- Edge cases and failure modes
- Attribution baseline: What components seem involved?
- Run logit attribution on several examples
- Note which heads/layers have consistent high attribution
- Form preliminary hypotheses
21.4.3 Phase 3: Intervention Experiments
Test your hypotheses with causal interventions:
Ablation sweep: Ablate each suspected component individually
Patching validation: For the most important components, run patching to confirm causality
Path patching: Trace connections between important components
21.4.4 Phase 4: Circuit Synthesis
If your experiments succeed: - Draw the circuit diagram - Label each component’s function - Write the algorithm in pseudocode - List predictions your circuit makes
21.4.5 Phase 5: Verification
Test your circuit’s predictions: - Does ablating the circuit break the behavior? - Does the circuit explain variations in behavior? - Does it generalize to held-out examples?
Research rarely follows this sequence linearly. You’ll form a hypothesis, run experiments, find surprising results, revise the hypothesis, run more experiments. This is normal. The structure is a guide, not a prescription.
21.5 Debugging Interpretability Experiments
Things will go wrong. Here’s how to debug systematically.
21.5.1 Problem: “Nothing has high attribution”
Possible causes: - Task isn’t actually hard (model gets it “for free”) - Attribution distributed across many components - Looking at wrong layer
Fixes: - Choose a harder prompt where the model barely succeeds - Sum attribution across component groups (all layer 5 heads) - Try different layers
21.5.2 Problem: “Ablation has no effect”
Possible causes: - Backup circuits compensating - Wrong ablation type (zero vs. mean) - Component not actually used for this task
Fixes: - Ablate multiple components simultaneously - Try different ablation methods - Verify the component has high attribution first
21.5.3 Problem: “Patching results are noisy”
Possible causes: - Bad clean/corrupted pair (too different or too similar) - Distribution shift from patching - Small effect size
Fixes: - Construct cleaner minimal pairs - Use resample ablation instead of zero - Average over many examples
21.5.4 Problem: “Can’t find the circuit”
Possible causes: - Behavior is distributed (no clean circuit) - Looking at wrong level (need features, not heads) - Behavior is too complex for current methods
Fixes: - Try SAE feature-level analysis - Simplify the behavior (narrower task) - Accept partial understanding (important components without full circuit)
21.5.5 Problem: “Results don’t replicate”
Possible causes: - Random seed sensitivity - Prompt sensitivity - Bugs in code
Fixes: - Run with multiple seeds - Test on many prompts - Review code carefully (or get someone else to)
When confused, simplify. Reduce to the smallest example that shows the phenomenon. If you can’t replicate on a simple example, you probably don’t understand the phenomenon.
21.5.6 When to Change Strategy
Sometimes the problem isn’t a bug—it’s a signal that your approach needs rethinking.
Signs the task is too hard for current methods:
- More than 50% of components have moderate attribution (nothing stands out)
- Ablating any single component changes behavior by <5%
- Results vary wildly across semantically similar prompts
- You’ve spent 2+ weeks without progress
Signs you need a different approach:
| If you’re doing… | Try instead… |
|---|---|
| Single-head ablation | Multi-component ablation |
| Zero ablation | Mean/resample ablation |
| Component-level analysis | SAE feature-level analysis |
| Looking at late layers | Looking at earlier layers |
| One example | Many examples with statistics |
When to accept partial understanding:
Not every behavior has a clean circuit. It’s scientifically valid to report: - “These 5 heads are the most important, but account for only 40% of the effect” - “The behavior appears distributed across many components” - “We found the circuit for the simple case; the complex case remains unclear”
Partial results are still valuable—they constrain future research.
21.5.7 The Debugging Checklist
Before concluding that something doesn’t work, verify:
□ Model actually performs the task (check logits, not just sampling)
□ Using the right token positions (off-by-one errors are common)
□ Cache is from the right input (easy to mix up in notebooks)
□ Patching direction is correct (clean→corrupted vs corrupted→clean)
□ Ablation value is sensible (zero? mean over dataset? mean over sequence?)
□ Batch dimensions are handled correctly
□ Running on GPU if expected (CPU can give different numerical results)
□ Random seeds are set for reproducibility21.6 Tools and Infrastructure
21.6.1 Essential Libraries
TransformerLens: Core library for hooked model execution - run_with_cache() for activation access - run_with_hooks() for interventions - Built-in support for common models - As of 2024-2025, supports most major model architectures including Llama, Gemma, and Mistral
SAELens: Sparse autoencoder training and analysis - Train SAEs on cached activations - Supports TopK, Gated, and JumpReLU SAE architectures - Feature visualization tools - Integration with Neuronpedia for sharing features
CircuitsVis: Visualization library - Attention pattern visualization - Activation plots - Interactive exploration
Neuronpedia: Interactive feature explorer (neuronpedia.org) - Browse SAE features across models - Community-contributed feature labels - As of late 2024, expanded to include features from Gemma 2, Llama 3, and other modern architectures
When evaluating SAE quality, be aware that proxy metrics (reconstruction loss, sparsity, interpretability scores) don’t reliably predict practical performance. The SAEBench benchmark suite tests SAEs on downstream tasks like feature steering and circuit discovery. Consider using these practical evaluations rather than relying solely on traditional metrics.
21.6.2 Compute Considerations
Minimum: Laptop with 8GB RAM - GPT-2 Small experiments - Small-scale SAE training - Visualization and exploration
Recommended: GPU with 16GB+ VRAM (or cloud equivalent) - GPT-2 Medium/Large experiments - Production SAE training - Systematic sweeps
Ideal: Multiple GPUs or cloud compute - Large model experiments - Full circuit discovery - Parameter sweeps
Cloud options: Google Colab (free tier for exploration), Modal, Lambda Labs
21.6.3 Experiment Tracking
Track your experiments systematically: - What hypothesis were you testing? - What exact code did you run? - What were the results? - What did you learn?
Tools: W&B, MLflow, or even a simple lab notebook. The format matters less than consistency.
21.7 Publishing and Community
21.7.2 What Makes a Good Post
Clear claim: What did you discover? Reproducible methods: Code or detailed procedure Honest limitations: What doesn’t this show? Connection to context: How does this fit with prior work?
21.7.3 The Community
Mechanistic interpretability has an unusually collaborative culture: - Researchers share work in progress - Code is typically open source - Feedback is constructive
Engage genuinely. Ask questions. Share your work even when preliminary.
21.8 The Learning Progression
21.8.1 Stage 1: Replication (1-3 months)
- Replicate 2-3 known results
- Get comfortable with tools
- Build intuition for what results “look like”
21.8.2 Stage 2: Extension (3-6 months)
- Take a known result and extend it
- “What happens if we do the same analysis on a different task?”
- “Does this circuit exist in a different model?”
21.8.3 Stage 3: Original Research (6+ months)
- Find your own circuits
- Develop new methods
- Contribute to open problems
21.8.4 Stage 4: Field Building (ongoing)
- Mentor newcomers
- Build tools
- Set research agendas
Most people take 3-6 months to make their first original contribution. This is normal.
21.9 A Sample Curriculum
Week 1-2: Environment setup, basic exploration
Week 3-4: Replicate induction head finding
Week 5-6: Replicate IOI circuit (simplified version)
Week 7-8: Train your first SAE on a small model
Week 9-10: Choose a problem from Neel Nanda’s list
Week 11-12: Run initial experiments, debug, iterate
Week 13+: Continue research, write up findings
Adjust based on your pace. Some finish faster; some need more time. Both are fine.
21.10 Polya’s Perspective: Learning by Doing
Polya’s central thesis: you learn problem-solving by solving problems—not by reading about solving problems.
This entire series has been reading. Essential reading—you need concepts to think with. But reading is preparation, not the destination.
The destination is practice. Real experiments on real models, finding real results.
“Mathematics is not a spectator sport.” Neither is interpretability. You understand transformer circuits by analyzing transformer circuits—not by reading chapters about analyzing transformer circuits. This chapter ends; your practice begins.
21.11 Conclusion: The Journey Ahead
You now have everything you need to start: - Conceptual foundation - Technical toolkit - Research methodology - Debugging heuristics - Community resources
What you don’t have—what you can only get through practice—is intuition. The sense of what’s worth investigating. The pattern recognition that spots anomalies. The judgment that knows when to push deeper and when to pivot.
These come with time and experience. There’s no shortcut.
Mechanistic interpretability is a young field working on hard problems with uncertain methods. You will get stuck. You will find bugs. You will have weeks where nothing works.
This is normal. It’s also what makes the field exciting. The open problems from Chapter 14 are genuinely open. Contributions are genuinely possible. Insights that matter are within reach—not only for senior researchers at big labs, but for newcomers with fresh perspectives.
The transformer’s circuits are waiting. Go find them.
21.12 Resources
21.12.1 Your First Project: Concrete Suggestions
Pick ONE project based on your available time:
Weekend Project (~8 hours)
- Replicate induction head finding in GPT-2 Small using the notebook
- Find a polysemantic neuron and document what concepts it responds to
- Explore 50 SAE features on Neuronpedia and write up the most interesting 5
Week Project (~20-40 hours)
- Train your own SAE on GPT-2 Small layer 6 using SAE Lens
- Replicate the IOI circuit analysis (find the 26 heads)
- Find a new “simple” circuit (e.g., detecting questions, predicting punctuation)
Research Project (~1-3 months)
- Investigate a phenomenon from Chapter 14’s open problems
- Apply interpretability techniques to a new model or task
- Develop improved methods for feature finding or circuit discovery
21.12.2 Getting Started
- TransformerLens Documentation — GitHub
- SAELens Documentation — GitHub
- Neel Nanda’s 200 Problems — neelnanda.io
- ARENA Curriculum — GitHub: Structured exercises for learning interpretability
21.12.3 Community
- EleutherAI Discord — discord.gg/eleutherai: Active #interpretability channel, great for questions
- Alignment Forum — alignmentforum.org: Interpretability research discussions and papers
- MATS Program — matsprogram.org: Mentored research, applications twice yearly
- AI Safety Camp — aisafety.camp: Intensive research programs
21.12.4 Who to Follow
- Neel Nanda (@neaboris): TransformerLens creator, prolific educator
- Chris Olah (@ch402): Anthropic interpretability lead
- Anthropic Interpretability Team (@AnthropicAI): Major research releases
- Joseph Bloom (@jbloomaus): SAE Lens creator
21.12.5 Reference
- TransformerLens Docs — Complete API reference
- Anthropic’s Circuits Papers — transformer-circuits.pub
- Neuronpedia — neuronpedia.org: Interactive feature explorer