21  A Practice Regime

From reading to research

synthesis
practice
Author

Taras Tsugrii

Published

January 5, 2025

TipWhat You’ll Learn
  • How to set up your environment (TransformerLens, SAELens)
  • A week-by-week practice curriculum for building skills
  • How to choose good research problems
  • Common pitfalls and debugging strategies
WarningPrerequisites

Recommended: The entire series, especially Chapter 13: Induction Heads for seeing techniques in action

21.1 From Theory to Practice

You’ve read fourteen chapters. You understand: - What we’re trying to do (reverse-engineer neural networks) - Why it’s hard (superposition, scale, composition) - The concepts (features, circuits, the residual stream) - The techniques (SAEs, attribution, patching, ablation) - The case study (induction heads) - The open problems (scaling, validation, coverage)

Now what?

This final chapter is about doing—turning conceptual understanding into research practice. How do you actually find features, trace circuits, and contribute to the field?

NoteThe Goal

By the end of this chapter, you should have a concrete plan for your first interpretability project—and the debugging skills to carry it through.

21.2 Starting From Zero

If you’ve never run an interpretability experiment, start here.

TipThe Uncomfortable Truth

You will feel confused. You will run experiments that don’t work. You will misinterpret results. This is normal. The researchers who wrote the papers you’ve been reading? They felt confused too. The difference is they persisted through the confusion until patterns emerged.

Interpretability research is less “execute algorithm, get answer” and more “wander through fog, occasionally glimpse something.” The fog is part of the process.

21.2.1 Week 1: Environment Setup

1. Install TransformerLens

The standard library for transformer interpretability:

pip install transformer-lens

Load a model:

from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")

2. Run your first forward pass with caching

text = "The capital of France is"
tokens = model.to_tokens(text)
logits, cache = model.run_with_cache(tokens)

# What does the model predict?
next_token = logits[0, -1].argmax()
print(model.to_string(next_token))  # Should be " Paris"

3. Explore the cache

# What's cached?
print(cache.keys())

# Look at residual stream at layer 5
residual = cache["blocks.5.hook_resid_post"]
print(residual.shape)  # [batch, seq_len, d_model]

# Look at attention patterns for head 3.2
attn = cache["blocks.3.attn.hook_pattern"]
print(attn.shape)  # [batch, n_heads, seq_len, seq_len]

4. Visualize an attention pattern

import circuitsvis as cv

# Show attention for head 3.2
cv.attention.attention_patterns(
    tokens=model.to_str_tokens(text),
    attention=cache["blocks.3.attn.hook_pattern"][0]
)

Spend time exploring. Poke at different cache keys. Visualize different heads. Get comfortable with the API.

21.2.2 Week 2: Replicate a Known Result

Before discovering anything new, replicate something known. This verifies your setup works and builds intuition.

Suggested replication: Find an induction head

  1. Create repeated sequences:
text = "AB CD EF AB CD"
  1. Find heads where position “CD” (second occurrence) attends to “CD” (first occurrence):
tokens = model.to_tokens(text)
_, cache = model.run_with_cache(tokens)

# For each head, measure induction pattern
for layer in range(model.cfg.n_layers):
    for head in range(model.cfg.n_heads):
        pattern = cache[f"blocks.{layer}.attn.hook_pattern"][0, head]
        # Check if position 6 attends to position 2 (both are "CD" tokens)
        induction_score = pattern[6, 2].item()
        if induction_score > 0.3:
            print(f"Potential induction head: {layer}.{head}, score: {induction_score:.2f}")

If you find heads in layers 1-3 with high scores, you’ve found induction heads. Compare with known results to verify.

Why replication matters: You’ll make mistakes. Replication catches bugs before they matter—you know what the answer should be.

21.2.3 Week 3: Your First Original Observation

Make one small observation nobody has made before.

Suggestions: - Find what head activates most on your name - Trace what happens when you type your favorite programming language - Find the attention pattern on a specific meme or phrase

The observation doesn’t need to be important. It needs to be yours—something you discovered through exploration.

TipThe Exploration Mindset

Interpretability research is exploration. The most important skill is curiosity: “What happens if I…?” Run the experiment. See what happens. Follow surprises.

21.3 Choosing a Research Problem

After exploration, you need focus. How do you choose what to work on?

21.3.1 The Problem Selection Framework

Evaluate problems on three dimensions:

1. Tractability: Can this actually be solved with current methods?

  • Good: “Find the circuit for three-digit addition in GPT-2”
  • Bad: “Fully explain GPT-4’s reasoning capabilities”

Start with narrow, well-defined behaviors.

2. Importance: Does the solution matter?

  • Good: “Understand how models represent deception” (safety-relevant)
  • Good: “Find circuits that transfer across model sizes” (methodological)
  • Mediocre: “Catalog every attention pattern in layer 3” (low insight)

3. Personal fit: Is this something you can uniquely contribute to?

  • Your background (performance engineering? linguistics? mathematics?)
  • Your interests (what do you find fascinating?)
  • Your resources (compute? collaborators? time?)

21.3.2 Concrete Problem Types

Feature discovery: What features exist? - Train SAEs on unexplored layers/models - Find features for specific domains (code, math, safety) - Study feature geometry and clustering

Circuit analysis: How does capability X work? - Pick a narrow behavior (parenthesis matching, country-capital, etc.) - Apply the full methodology: attribution → ablation → patching → diagram - Consider automated circuit discovery tools (ACDC, CD-T) for efficiency

NoteAutomated Circuit Discovery (2024-2025)

Beyond manual patching, tools like ACDC and CD-T (Compact Discovery via Transformations) can automatically identify minimal circuits. CD-T is particularly efficient for larger models. However, automated methods still require human interpretation of the discovered components—they find which components matter, not why.

Methodology development: Better tools - Improved SAE architectures - Faster patching methods - Better visualization tools

Scaling studies: What changes with size? - Compare circuits across model sizes - Study how features evolve during training - Test whether small-model findings transfer

21.3.3 Neel Nanda’s 200 Problems

Neel Nanda maintains a list of 200 concrete open problems. Each is: - Specific enough to start on - Open enough to be original - Calibrated for difficulty

This is the best starting point for finding a project.

21.4 The Research Workflow

Once you have a problem, how do you make progress?

21.4.1 Phase 1: Hypothesis Formation

Before running experiments, write down your hypothesis:

“I believe that [capability X] is implemented by [component Y] because [reasoning Z].”

Example: “I believe that parenthesis matching is implemented by attention heads that track nesting depth, because this requires sequential counting across positions.”

Be specific. Vague hypotheses lead to vague experiments.

21.4.2 Phase 2: Baseline Measurements

Establish what you’re measuring:

  1. Behavioral baseline: What does the model actually do?
    • Accuracy on your task
    • Logit differences between correct/incorrect answers
    • Edge cases and failure modes
  2. Attribution baseline: What components seem involved?
    • Run logit attribution on several examples
    • Note which heads/layers have consistent high attribution
    • Form preliminary hypotheses

21.4.3 Phase 3: Intervention Experiments

Test your hypotheses with causal interventions:

Ablation sweep: Ablate each suspected component individually

for component in suspected_components:
    ablated_accuracy = run_with_ablation(component)
    print(f"{component}: {ablated_accuracy}")

Patching validation: For the most important components, run patching to confirm causality

Path patching: Trace connections between important components

21.4.4 Phase 4: Circuit Synthesis

If your experiments succeed: - Draw the circuit diagram - Label each component’s function - Write the algorithm in pseudocode - List predictions your circuit makes

21.4.5 Phase 5: Verification

Test your circuit’s predictions: - Does ablating the circuit break the behavior? - Does the circuit explain variations in behavior? - Does it generalize to held-out examples?

ImportantThe Iteration Loop

Research rarely follows this sequence linearly. You’ll form a hypothesis, run experiments, find surprising results, revise the hypothesis, run more experiments. This is normal. The structure is a guide, not a prescription.

21.5 Debugging Interpretability Experiments

Things will go wrong. Here’s how to debug systematically.

flowchart TD
    START["Experiment not working"] --> CHECK["Check basics first"]
    CHECK --> Q1{"Is the model<br/>doing the task<br/>at all?"}
    Q1 -->|No| FIX1["Task too hard, or<br/>wrong task definition"]
    Q1 -->|Yes| Q2{"Are attributions<br/>making sense?"}
    Q2 -->|No| FIX2["Wrong layer, or<br/>distributed computation"]
    Q2 -->|Yes| Q3{"Do ablations<br/>have effect?"}
    Q3 -->|No| FIX3["Backup circuits, or<br/>wrong ablation type"]
    Q3 -->|Yes| Q4{"Is patching<br/>clean?"}
    Q4 -->|No| FIX4["Bad clean/corrupted pair,<br/>or distribution shift"]
    Q4 -->|Yes| SUCCESS["Circuit analysis<br/>proceeding normally"]

    FIX1 --> RETRY["Simplify & Retry"]
    FIX2 --> RETRY
    FIX3 --> RETRY
    FIX4 --> RETRY

Systematic debugging workflow: follow this decision tree when experiments don’t work.

21.5.1 Problem: “Nothing has high attribution”

Possible causes: - Task isn’t actually hard (model gets it “for free”) - Attribution distributed across many components - Looking at wrong layer

Fixes: - Choose a harder prompt where the model barely succeeds - Sum attribution across component groups (all layer 5 heads) - Try different layers

21.5.2 Problem: “Ablation has no effect”

Possible causes: - Backup circuits compensating - Wrong ablation type (zero vs. mean) - Component not actually used for this task

Fixes: - Ablate multiple components simultaneously - Try different ablation methods - Verify the component has high attribution first

21.5.3 Problem: “Patching results are noisy”

Possible causes: - Bad clean/corrupted pair (too different or too similar) - Distribution shift from patching - Small effect size

Fixes: - Construct cleaner minimal pairs - Use resample ablation instead of zero - Average over many examples

21.5.4 Problem: “Can’t find the circuit”

Possible causes: - Behavior is distributed (no clean circuit) - Looking at wrong level (need features, not heads) - Behavior is too complex for current methods

Fixes: - Try SAE feature-level analysis - Simplify the behavior (narrower task) - Accept partial understanding (important components without full circuit)

21.5.5 Problem: “Results don’t replicate”

Possible causes: - Random seed sensitivity - Prompt sensitivity - Bugs in code

Fixes: - Run with multiple seeds - Test on many prompts - Review code carefully (or get someone else to)

TipThe Cardinal Rule

When confused, simplify. Reduce to the smallest example that shows the phenomenon. If you can’t replicate on a simple example, you probably don’t understand the phenomenon.

21.5.6 When to Change Strategy

Sometimes the problem isn’t a bug—it’s a signal that your approach needs rethinking.

Signs the task is too hard for current methods:

  • More than 50% of components have moderate attribution (nothing stands out)
  • Ablating any single component changes behavior by <5%
  • Results vary wildly across semantically similar prompts
  • You’ve spent 2+ weeks without progress

Signs you need a different approach:

If you’re doing… Try instead…
Single-head ablation Multi-component ablation
Zero ablation Mean/resample ablation
Component-level analysis SAE feature-level analysis
Looking at late layers Looking at earlier layers
One example Many examples with statistics

When to accept partial understanding:

Not every behavior has a clean circuit. It’s scientifically valid to report: - “These 5 heads are the most important, but account for only 40% of the effect” - “The behavior appears distributed across many components” - “We found the circuit for the simple case; the complex case remains unclear”

Partial results are still valuable—they constrain future research.

21.5.7 The Debugging Checklist

Before concluding that something doesn’t work, verify:

□ Model actually performs the task (check logits, not just sampling)
□ Using the right token positions (off-by-one errors are common)
□ Cache is from the right input (easy to mix up in notebooks)
□ Patching direction is correct (clean→corrupted vs corrupted→clean)
□ Ablation value is sensible (zero? mean over dataset? mean over sequence?)
□ Batch dimensions are handled correctly
□ Running on GPU if expected (CPU can give different numerical results)
□ Random seeds are set for reproducibility

21.6 Tools and Infrastructure

21.6.1 Essential Libraries

TransformerLens: Core library for hooked model execution - run_with_cache() for activation access - run_with_hooks() for interventions - Built-in support for common models - As of 2024-2025, supports most major model architectures including Llama, Gemma, and Mistral

SAELens: Sparse autoencoder training and analysis - Train SAEs on cached activations - Supports TopK, Gated, and JumpReLU SAE architectures - Feature visualization tools - Integration with Neuronpedia for sharing features

CircuitsVis: Visualization library - Attention pattern visualization - Activation plots - Interactive exploration

Neuronpedia: Interactive feature explorer (neuronpedia.org) - Browse SAE features across models - Community-contributed feature labels - As of late 2024, expanded to include features from Gemma 2, Llama 3, and other modern architectures

TipEvaluation: SAEBench (2025)

When evaluating SAE quality, be aware that proxy metrics (reconstruction loss, sparsity, interpretability scores) don’t reliably predict practical performance. The SAEBench benchmark suite tests SAEs on downstream tasks like feature steering and circuit discovery. Consider using these practical evaluations rather than relying solely on traditional metrics.

21.6.2 Compute Considerations

Minimum: Laptop with 8GB RAM - GPT-2 Small experiments - Small-scale SAE training - Visualization and exploration

Recommended: GPU with 16GB+ VRAM (or cloud equivalent) - GPT-2 Medium/Large experiments - Production SAE training - Systematic sweeps

Ideal: Multiple GPUs or cloud compute - Large model experiments - Full circuit discovery - Parameter sweeps

Cloud options: Google Colab (free tier for exploration), Modal, Lambda Labs

21.6.3 Experiment Tracking

Track your experiments systematically: - What hypothesis were you testing? - What exact code did you run? - What were the results? - What did you learn?

Tools: W&B, MLflow, or even a simple lab notebook. The format matters less than consistency.

21.7 Publishing and Community

21.7.1 Where to Share

LessWrong/AF: The primary venue for interpretability results - Conceptual pieces and research writeups - Active community feedback - Low barrier to entry

arXiv/Conferences: For more polished work - ICML, NeurIPS, ICLR for major results - Mechanistic interpretability workshops

Twitter/X: For quick results and discussion - Build visibility - Get fast feedback - Connect with researchers

21.7.2 What Makes a Good Post

Clear claim: What did you discover? Reproducible methods: Code or detailed procedure Honest limitations: What doesn’t this show? Connection to context: How does this fit with prior work?

21.7.3 The Community

Mechanistic interpretability has an unusually collaborative culture: - Researchers share work in progress - Code is typically open source - Feedback is constructive

Engage genuinely. Ask questions. Share your work even when preliminary.

21.8 The Learning Progression

21.8.1 Stage 1: Replication (1-3 months)

  • Replicate 2-3 known results
  • Get comfortable with tools
  • Build intuition for what results “look like”

21.8.2 Stage 2: Extension (3-6 months)

  • Take a known result and extend it
  • “What happens if we do the same analysis on a different task?”
  • “Does this circuit exist in a different model?”

21.8.3 Stage 3: Original Research (6+ months)

  • Find your own circuits
  • Develop new methods
  • Contribute to open problems

21.8.4 Stage 4: Field Building (ongoing)

  • Mentor newcomers
  • Build tools
  • Set research agendas

Most people take 3-6 months to make their first original contribution. This is normal.

21.9 A Sample Curriculum

Week 1-2: Environment setup, basic exploration

Week 3-4: Replicate induction head finding

Week 5-6: Replicate IOI circuit (simplified version)

Week 7-8: Train your first SAE on a small model

Week 9-10: Choose a problem from Neel Nanda’s list

Week 11-12: Run initial experiments, debug, iterate

Week 13+: Continue research, write up findings

Adjust based on your pace. Some finish faster; some need more time. Both are fine.

21.10 Polya’s Perspective: Learning by Doing

Polya’s central thesis: you learn problem-solving by solving problems—not by reading about solving problems.

This entire series has been reading. Essential reading—you need concepts to think with. But reading is preparation, not the destination.

The destination is practice. Real experiments on real models, finding real results.

TipPolya’s Insight

“Mathematics is not a spectator sport.” Neither is interpretability. You understand transformer circuits by analyzing transformer circuits—not by reading chapters about analyzing transformer circuits. This chapter ends; your practice begins.

21.11 Conclusion: The Journey Ahead

You now have everything you need to start: - Conceptual foundation - Technical toolkit - Research methodology - Debugging heuristics - Community resources

What you don’t have—what you can only get through practice—is intuition. The sense of what’s worth investigating. The pattern recognition that spots anomalies. The judgment that knows when to push deeper and when to pivot.

These come with time and experience. There’s no shortcut.

Mechanistic interpretability is a young field working on hard problems with uncertain methods. You will get stuck. You will find bugs. You will have weeks where nothing works.

This is normal. It’s also what makes the field exciting. The open problems from Chapter 14 are genuinely open. Contributions are genuinely possible. Insights that matter are within reach—not only for senior researchers at big labs, but for newcomers with fresh perspectives.

The transformer’s circuits are waiting. Go find them.


21.12 Resources

21.12.1 Your First Project: Concrete Suggestions

ImportantChoose Your Adventure

Pick ONE project based on your available time:

Weekend Project (~8 hours)

  • Replicate induction head finding in GPT-2 Small using the notebook
  • Find a polysemantic neuron and document what concepts it responds to
  • Explore 50 SAE features on Neuronpedia and write up the most interesting 5

Week Project (~20-40 hours)

  • Train your own SAE on GPT-2 Small layer 6 using SAE Lens
  • Replicate the IOI circuit analysis (find the 26 heads)
  • Find a new “simple” circuit (e.g., detecting questions, predicting punctuation)

Research Project (~1-3 months)

  • Investigate a phenomenon from Chapter 14’s open problems
  • Apply interpretability techniques to a new model or task
  • Develop improved methods for feature finding or circuit discovery

21.12.2 Getting Started

  • TransformerLens DocumentationGitHub
  • SAELens DocumentationGitHub
  • Neel Nanda’s 200 Problemsneelnanda.io
  • ARENA CurriculumGitHub: Structured exercises for learning interpretability

21.12.3 Community

21.12.4 Who to Follow

  • Neel Nanda (@neaboris): TransformerLens creator, prolific educator
  • Chris Olah (@ch402): Anthropic interpretability lead
  • Anthropic Interpretability Team (@AnthropicAI): Major research releases
  • Joseph Bloom (@jbloomaus): SAE Lens creator

21.12.5 Reference