21 A Practice Regime

From reading to research

synthesis

practice

Author

Taras Tsugrii

Published

January 5, 2025

What You’ll Learn

How to set up your environment (TransformerLens, SAELens)
A week-by-week practice curriculum for building skills
How to choose good research problems
Common pitfalls and debugging strategies

Prerequisites

Recommended: The entire series, especially Chapter 13: Induction Heads for seeing techniques in action

21.1 From Theory to Practice

You’ve read fourteen chapters. You understand: - What we’re trying to do (reverse-engineer neural networks) - Why it’s hard (superposition, scale, composition) - The concepts (features, circuits, the residual stream) - The techniques (SAEs, attribution, patching, ablation) - The case study (induction heads) - The open problems (scaling, validation, coverage)

Now what?

This final chapter is about doing—turning conceptual understanding into research practice. How do you actually find features, trace circuits, and contribute to the field?

The Goal

By the end of this chapter, you should have a concrete plan for your first interpretability project—and the debugging skills to carry it through.

21.2 Starting From Zero

If you’ve never run an interpretability experiment, start here.

The Uncomfortable Truth

You will feel confused. You will run experiments that don’t work. You will misinterpret results. This is normal. The researchers who wrote the papers you’ve been reading? They felt confused too. The difference is they persisted through the confusion until patterns emerged.

Interpretability research is less “execute algorithm, get answer” and more “wander through fog, occasionally glimpse something.” The fog is part of the process.

21.2.1 Week 1: Environment Setup

1. Install TransformerLens

The standard library for transformer interpretability:

pip install transformer-lens

Load a model:

from transformer_lens import HookedTransformer

model = HookedTransformer.from_pretrained("gpt2-small")

2. Run your first forward pass with caching

text = "The capital of France is"
tokens = model.to_tokens(text)
logits, cache = model.run_with_cache(tokens)

# What does the model predict?
next_token = logits[0, -1].argmax()
print(model.to_string(next_token))  # Should be " Paris"

3. Explore the cache

# What's cached?
print(cache.keys())

# Look at residual stream at layer 5
residual = cache["blocks.5.hook_resid_post"]
print(residual.shape)  # [batch, seq_len, d_model]

# Look at attention patterns for head 3.2
attn = cache["blocks.3.attn.hook_pattern"]
print(attn.shape)  # [batch, n_heads, seq_len, seq_len]

4. Visualize an attention pattern

import circuitsvis as cv

# Show attention for head 3.2
cv.attention.attention_patterns(
    tokens=model.to_str_tokens(text),
    attention=cache["blocks.3.attn.hook_pattern"][0]
)

Spend time exploring. Poke at different cache keys. Visualize different heads. Get comfortable with the API.

21.2.2 Week 2: Replicate a Known Result

Before discovering anything new, replicate something known. This verifies your setup works and builds intuition.

Suggested replication: Find an induction head

Create repeated sequences:

text = "AB CD EF AB CD"

Find heads where position “CD” (second occurrence) attends to “CD” (first occurrence):

tokens = model.to_tokens(text)
_, cache = model.run_with_cache(tokens)

# For each head, measure induction pattern
for layer in range(model.cfg.n_layers):
    for head in range(model.cfg.n_heads):
        pattern = cache[f"blocks.{layer}.attn.hook_pattern"][0, head]
        # Check if position 6 attends to position 2 (both are "CD" tokens)
        induction_score = pattern[6, 2].item()
        if induction_score > 0.3:
            print(f"Potential induction head: {layer}.{head}, score: {induction_score:.2f}")

If you find heads in layers 1-3 with high scores, you’ve found induction heads. Compare with known results to verify.

Why replication matters: You’ll make mistakes. Replication catches bugs before they matter—you know what the answer should be.

21.2.3 Week 3: Your First Original Observation

Make one small observation nobody has made before.

Suggestions: - Find what head activates most on your name - Trace what happens when you type your favorite programming language - Find the attention pattern on a specific meme or phrase

The observation doesn’t need to be important. It needs to be yours—something you discovered through exploration.

The Exploration Mindset

Interpretability research is exploration. The most important skill is curiosity: “What happens if I…?” Run the experiment. See what happens. Follow surprises.

21.3 Choosing a Research Problem

After exploration, you need focus. How do you choose what to work on?

21.3.1 The Problem Selection Framework

Evaluate problems on three dimensions:

1. Tractability: Can this actually be solved with current methods?

Good: “Find the circuit for three-digit addition in GPT-2”
Bad: “Fully explain GPT-4’s reasoning capabilities”

Start with narrow, well-defined behaviors.

2. Importance: Does the solution matter?

Good: “Understand how models represent deception” (safety-relevant)
Good: “Find circuits that transfer across model sizes” (methodological)
Mediocre: “Catalog every attention pattern in layer 3” (low insight)

3. Personal fit: Is this something you can uniquely contribute to?

Your background (performance engineering? linguistics? mathematics?)
Your interests (what do you find fascinating?)
Your resources (compute? collaborators? time?)

21.3.2 Concrete Problem Types

Feature discovery: What features exist? - Train SAEs on unexplored layers/models - Find features for specific domains (code, math, safety) - Study feature geometry and clustering

Circuit analysis: How does capability X work? - Pick a narrow behavior (parenthesis matching, country-capital, etc.) - Apply the full methodology: attribution → ablation → patching → diagram - Consider automated circuit discovery tools (ACDC, CD-T) for efficiency

Automated Circuit Discovery (2024-2025)

Beyond manual patching, tools like ACDC and CD-T (Compact Discovery via Transformations) can automatically identify minimal circuits. CD-T is particularly efficient for larger models. However, automated methods still require human interpretation of the discovered components—they find which components matter, not why.

Methodology development: Better tools - Improved SAE architectures - Faster patching methods - Better visualization tools

Scaling studies: What changes with size? - Compare circuits across model sizes - Study how features evolve during training - Test whether small-model findings transfer

21.3.3 Neel Nanda’s 200 Problems

Neel Nanda maintains a list of 200 concrete open problems. Each is: - Specific enough to start on - Open enough to be original - Calibrated for difficulty

This is the best starting point for finding a project.

21.4 The Research Workflow

Once you have a problem, how do you make progress?

21.4.1 Phase 1: Hypothesis Formation

Before running experiments, write down your hypothesis:

“I believe that [capability X] is implemented by [component Y] because [reasoning Z].”

Example: “I believe that parenthesis matching is implemented by attention heads that track nesting depth, because this requires sequential counting across positions.”

Be specific. Vague hypotheses lead to vague experiments.

21.4.2 Phase 2: Baseline Measurements

Establish what you’re measuring:

Behavioral baseline: What does the model actually do?
- Accuracy on your task
- Logit differences between correct/incorrect answers
- Edge cases and failure modes
Attribution baseline: What components seem involved?
- Run logit attribution on several examples
- Note which heads/layers have consistent high attribution
- Form preliminary hypotheses

21.4.3 Phase 3: Intervention Experiments

Test your hypotheses with causal interventions:

Ablation sweep: Ablate each suspected component individually

for component in suspected_components:
    ablated_accuracy = run_with_ablation(component)
    print(f"{component}: {ablated_accuracy}")

Patching validation: For the most important components, run patching to confirm causality

Path patching: Trace connections between important components

21.4.4 Phase 4: Circuit Synthesis

If your experiments succeed: - Draw the circuit diagram - Label each component’s function - Write the algorithm in pseudocode - List predictions your circuit makes

21.4.5 Phase 5: Verification

Test your circuit’s predictions: - Does ablating the circuit break the behavior? - Does the circuit explain variations in behavior? - Does it generalize to held-out examples?

The Iteration Loop

Research rarely follows this sequence linearly. You’ll form a hypothesis, run experiments, find surprising results, revise the hypothesis, run more experiments. This is normal. The structure is a guide, not a prescription.

21.5 Debugging Interpretability Experiments

Things will go wrong. Here’s how to debug systematically.

flowchart TD
    START["Experiment not working"] --> CHECK["Check basics first"]
    CHECK --> Q1{"Is the model<br/>doing the task<br/>at all?"}
    Q1 -->|No| FIX1["Task too hard, or<br/>wrong task definition"]
    Q1 -->|Yes| Q2{"Are attributions<br/>making sense?"}
    Q2 -->|No| FIX2["Wrong layer, or<br/>distributed computation"]
    Q2 -->|Yes| Q3{"Do ablations<br/>have effect?"}
    Q3 -->|No| FIX3["Backup circuits, or<br/>wrong ablation type"]
    Q3 -->|Yes| Q4{"Is patching<br/>clean?"}
    Q4 -->|No| FIX4["Bad clean/corrupted pair,<br/>or distribution shift"]
    Q4 -->|Yes| SUCCESS["Circuit analysis<br/>proceeding normally"]

    FIX1 --> RETRY["Simplify & Retry"]
    FIX2 --> RETRY
    FIX3 --> RETRY
    FIX4 --> RETRY

Systematic debugging workflow: follow this decision tree when experiments don’t work.

21.5.1 Problem: “Nothing has high attribution”

Possible causes: - Task isn’t actually hard (model gets it “for free”) - Attribution distributed across many components - Looking at wrong layer

Fixes: - Choose a harder prompt where the model barely succeeds - Sum attribution across component groups (all layer 5 heads) - Try different layers

21.5.2 Problem: “Ablation has no effect”

Possible causes: - Backup circuits compensating - Wrong ablation type (zero vs. mean) - Component not actually used for this task

Fixes: - Ablate multiple components simultaneously - Try different ablation methods - Verify the component has high attribution first

21.5.3 Problem: “Patching results are noisy”

Possible causes: - Bad clean/corrupted pair (too different or too similar) - Distribution shift from patching - Small effect size

Fixes: - Construct cleaner minimal pairs - Use resample ablation instead of zero - Average over many examples

21.5.4 Problem: “Can’t find the circuit”

Possible causes: - Behavior is distributed (no clean circuit) - Looking at wrong level (need features, not heads) - Behavior is too complex for current methods

Fixes: - Try SAE feature-level analysis - Simplify the behavior (narrower task) - Accept partial understanding (important components without full circuit)

21.5.5 Problem: “Results don’t replicate”

Possible causes: - Random seed sensitivity - Prompt sensitivity - Bugs in code

Fixes: - Run with multiple seeds - Test on many prompts - Review code carefully (or get someone else to)

The Cardinal Rule

When confused, simplify. Reduce to the smallest example that shows the phenomenon. If you can’t replicate on a simple example, you probably don’t understand the phenomenon.

21.5.6 When to Change Strategy

Sometimes the problem isn’t a bug—it’s a signal that your approach needs rethinking.

Signs You Should Pivot

Signs the task is too hard for current methods:

More than 50% of components have moderate attribution (nothing stands out)
Ablating any single component changes behavior by <5%
Results vary wildly across semantically similar prompts
You’ve spent 2+ weeks without progress

Signs you need a different approach:

If you’re doing…	Try instead…
Single-head ablation	Multi-component ablation
Zero ablation	Mean/resample ablation
Component-level analysis	SAE feature-level analysis
Looking at late layers	Looking at earlier layers
One example	Many examples with statistics

When to accept partial understanding:

Not every behavior has a clean circuit. It’s scientifically valid to report: - “These 5 heads are the most important, but account for only 40% of the effect” - “The behavior appears distributed across many components” - “We found the circuit for the simple case; the complex case remains unclear”

Partial results are still valuable—they constrain future research.

21.5.7 The Debugging Checklist

Before concluding that something doesn’t work, verify:

□ Model actually performs the task (check logits, not just sampling)
□ Using the right token positions (off-by-one errors are common)
□ Cache is from the right input (easy to mix up in notebooks)
□ Patching direction is correct (clean→corrupted vs corrupted→clean)
□ Ablation value is sensible (zero? mean over dataset? mean over sequence?)
□ Batch dimensions are handled correctly
□ Running on GPU if expected (CPU can give different numerical results)
□ Random seeds are set for reproducibility

21.6 Tools and Infrastructure

21.6.1 Essential Libraries

TransformerLens: Core library for hooked model execution - run_with_cache() for activation access - run_with_hooks() for interventions - Built-in support for common models - As of 2024-2025, supports most major model architectures including Llama, Gemma, and Mistral

SAELens: Sparse autoencoder training and analysis - Train SAEs on cached activations - Supports TopK, Gated, and JumpReLU SAE architectures - Feature visualization tools - Integration with Neuronpedia for sharing features

CircuitsVis: Visualization library - Attention pattern visualization - Activation plots - Interactive exploration

Neuronpedia: Interactive feature explorer (neuronpedia.org) - Browse SAE features across models - Community-contributed feature labels - As of late 2024, expanded to include features from Gemma 2, Llama 3, and other modern architectures

Evaluation: SAEBench (2025)

When evaluating SAE quality, be aware that proxy metrics (reconstruction loss, sparsity, interpretability scores) don’t reliably predict practical performance. The SAEBench benchmark suite tests SAEs on downstream tasks like feature steering and circuit discovery. Consider using these practical evaluations rather than relying solely on traditional metrics.

21.6.2 Compute Considerations

Minimum: Laptop with 8GB RAM - GPT-2 Small experiments - Small-scale SAE training - Visualization and exploration

Recommended: GPU with 16GB+ VRAM (or cloud equivalent) - GPT-2 Medium/Large experiments - Production SAE training - Systematic sweeps

Ideal: Multiple GPUs or cloud compute - Large model experiments - Full circuit discovery - Parameter sweeps

Cloud options: Google Colab (free tier for exploration), Modal, Lambda Labs

21.6.3 Experiment Tracking

Track your experiments systematically: - What hypothesis were you testing? - What exact code did you run? - What were the results? - What did you learn?

Tools: W&B, MLflow, or even a simple lab notebook. The format matters less than consistency.

21.7 Publishing and Community

21.7.2 What Makes a Good Post

Clear claim: What did you discover? Reproducible methods: Code or detailed procedure Honest limitations: What doesn’t this show? Connection to context: How does this fit with prior work?

21.7.3 The Community

Mechanistic interpretability has an unusually collaborative culture: - Researchers share work in progress - Code is typically open source - Feedback is constructive

Engage genuinely. Ask questions. Share your work even when preliminary.

21.8 The Learning Progression

21.8.1 Stage 1: Replication (1-3 months)

Replicate 2-3 known results
Get comfortable with tools
Build intuition for what results “look like”

21.8.2 Stage 2: Extension (3-6 months)

Take a known result and extend it
“What happens if we do the same analysis on a different task?”
“Does this circuit exist in a different model?”

21.8.3 Stage 3: Original Research (6+ months)

Find your own circuits
Develop new methods
Contribute to open problems

21.8.4 Stage 4: Field Building (ongoing)

Mentor newcomers
Build tools
Set research agendas

Most people take 3-6 months to make their first original contribution. This is normal.

21.9 A Sample Curriculum

Week 1-2: Environment setup, basic exploration

Week 3-4: Replicate induction head finding

Week 5-6: Replicate IOI circuit (simplified version)

Week 7-8: Train your first SAE on a small model

Week 9-10: Choose a problem from Neel Nanda’s list

Week 11-12: Run initial experiments, debug, iterate

Week 13+: Continue research, write up findings

Adjust based on your pace. Some finish faster; some need more time. Both are fine.

21.10 Polya’s Perspective: Learning by Doing

Polya’s central thesis: you learn problem-solving by solving problems—not by reading about solving problems.

This entire series has been reading. Essential reading—you need concepts to think with. But reading is preparation, not the destination.

The destination is practice. Real experiments on real models, finding real results.

Polya’s Insight

“Mathematics is not a spectator sport.” Neither is interpretability. You understand transformer circuits by analyzing transformer circuits—not by reading chapters about analyzing transformer circuits. This chapter ends; your practice begins.

21.11 Conclusion: The Journey Ahead

You now have everything you need to start: - Conceptual foundation - Technical toolkit - Research methodology - Debugging heuristics - Community resources

What you don’t have—what you can only get through practice—is intuition. The sense of what’s worth investigating. The pattern recognition that spots anomalies. The judgment that knows when to push deeper and when to pivot.

These come with time and experience. There’s no shortcut.

Mechanistic interpretability is a young field working on hard problems with uncertain methods. You will get stuck. You will find bugs. You will have weeks where nothing works.

This is normal. It’s also what makes the field exciting. The open problems from Chapter 14 are genuinely open. Contributions are genuinely possible. Insights that matter are within reach—not only for senior researchers at big labs, but for newcomers with fresh perspectives.

The transformer’s circuits are waiting. Go find them.

21.12 Resources

21.12.1 Your First Project: Concrete Suggestions

Choose Your Adventure

Pick ONE project based on your available time:

Weekend Project (~8 hours)

Replicate induction head finding in GPT-2 Small using the notebook
Find a polysemantic neuron and document what concepts it responds to
Explore 50 SAE features on Neuronpedia and write up the most interesting 5

Week Project (~20-40 hours)

Train your own SAE on GPT-2 Small layer 6 using SAE Lens
Replicate the IOI circuit analysis (find the 26 heads)
Find a new “simple” circuit (e.g., detecting questions, predicting punctuation)

Research Project (~1-3 months)

Investigate a phenomenon from Chapter 14’s open problems
Apply interpretability techniques to a new model or task
Develop improved methods for feature finding or circuit discovery

21.12.2 Getting Started

TransformerLens Documentation — GitHub
SAELens Documentation — GitHub
Neel Nanda’s 200 Problems — neelnanda.io
ARENA Curriculum — GitHub: Structured exercises for learning interpretability

21.12.3 Community

EleutherAI Discord — discord.gg/eleutherai: Active #interpretability channel, great for questions
Alignment Forum — alignmentforum.org: Interpretability research discussions and papers
MATS Program — matsprogram.org: Mentored research, applications twice yearly
AI Safety Camp — aisafety.camp: Intensive research programs

21.12.4 Who to Follow

Neel Nanda (@neaboris): TransformerLens creator, prolific educator
Chris Olah (@ch402): Anthropic interpretability lead
Anthropic Interpretability Team (@AnthropicAI): Major research releases
Joseph Bloom (@jbloomaus): SAE Lens creator

21.12.5 Reference

TransformerLens Docs — Complete API reference
Anthropic’s Circuits Papers — transformer-circuits.pub
Neuronpedia — neuronpedia.org: Interactive feature explorer

--- title: "A Practice Regime" subtitle: "From reading to research" author: "Taras Tsugrii" date: 2025-01-05 categories: [synthesis, practice] description: "Concrete guidance for actually doing mechanistic interpretability research: choosing problems, running experiments, debugging, and contributing to the field." --- ::: {.callout-tip} ## What You'll Learn - How to set up your environment (TransformerLens, SAELens) - A week-by-week practice curriculum for building skills - How to choose good research problems - Common pitfalls and debugging strategies ::: ::: {.callout-warning} ## Prerequisites **Recommended**: The entire series, especially [Chapter 13: Induction Heads](13-induction-heads.qmd) for seeing techniques in action ::: ## From Theory to Practice You've read fourteen chapters. You understand: - What we're trying to do (reverse-engineer neural networks) - Why it's hard (superposition, scale, composition) - The concepts (features, circuits, the residual stream) - The techniques (SAEs, attribution, patching, ablation) - The case study (induction heads) - The open problems (scaling, validation, coverage) Now what? This final chapter is about *doing*—turning conceptual understanding into research practice. How do you actually find features, trace circuits, and contribute to the field? ::: {.callout-note} ## The Goal By the end of this chapter, you should have a concrete plan for your first interpretability project—and the debugging skills to carry it through. ::: ## Starting From Zero If you've never run an interpretability experiment, start here. ::: {.callout-tip} ## The Uncomfortable Truth You will feel confused. You will run experiments that don't work. You will misinterpret results. This is normal. The researchers who wrote the papers you've been reading? They felt confused too. The difference is they persisted through the confusion until patterns emerged. Interpretability research is less "execute algorithm, get answer" and more "wander through fog, occasionally glimpse something." The fog is part of the process. ::: ### Week 1: Environment Setup **1. Install TransformerLens** The standard library for transformer interpretability: ```python pip install transformer-lens ``` Load a model: ```python from transformer_lens import HookedTransformer model = HookedTransformer.from_pretrained("gpt2-small") ``` **2. Run your first forward pass with caching** ```python text = "The capital of France is" tokens = model.to_tokens(text) logits, cache = model.run_with_cache(tokens) # What does the model predict? next_token = logits[0, -1].argmax() print(model.to_string(next_token)) # Should be " Paris" ``` **3. Explore the cache** ```python # What's cached? print(cache.keys()) # Look at residual stream at layer 5 residual = cache["blocks.5.hook_resid_post"] print(residual.shape) # [batch, seq_len, d_model] # Look at attention patterns for head 3.2 attn = cache["blocks.3.attn.hook_pattern"] print(attn.shape) # [batch, n_heads, seq_len, seq_len] ``` **4. Visualize an attention pattern** ```python import circuitsvis as cv # Show attention for head 3.2 cv.attention.attention_patterns( tokens=model.to_str_tokens(text), attention=cache["blocks.3.attn.hook_pattern"][0] ) ``` Spend time exploring. Poke at different cache keys. Visualize different heads. Get comfortable with the API. ### Week 2: Replicate a Known Result Before discovering anything new, replicate something known. This verifies your setup works and builds intuition. **Suggested replication: Find an induction head** 1. Create repeated sequences: ```python text = "AB CD EF AB CD" ``` 2. Find heads where position "CD" (second occurrence) attends to "CD" (first occurrence): ```python tokens = model.to_tokens(text) _, cache = model.run_with_cache(tokens) # For each head, measure induction pattern for layer in range(model.cfg.n_layers): for head in range(model.cfg.n_heads): pattern = cache[f"blocks.{layer}.attn.hook_pattern"][0, head] # Check if position 6 attends to position 2 (both are "CD" tokens) induction_score = pattern[6, 2].item() if induction_score > 0.3: print(f"Potential induction head: {layer}.{head}, score: {induction_score:.2f}") ``` If you find heads in layers 1-3 with high scores, you've found induction heads. Compare with known results to verify. **Why replication matters**: You'll make mistakes. Replication catches bugs before they matter—you know what the answer should be. ### Week 3: Your First Original Observation Make one small observation nobody has made before. **Suggestions**: - Find what head activates most on your name - Trace what happens when you type your favorite programming language - Find the attention pattern on a specific meme or phrase The observation doesn't need to be important. It needs to be *yours*—something you discovered through exploration. ::: {.callout-tip} ## The Exploration Mindset Interpretability research is exploration. The most important skill is curiosity: "What happens if I...?" Run the experiment. See what happens. Follow surprises. ::: ## Choosing a Research Problem After exploration, you need focus. How do you choose what to work on? ### The Problem Selection Framework Evaluate problems on three dimensions: **1. Tractability**: Can this actually be solved with current methods? - Good: "Find the circuit for three-digit addition in GPT-2" - Bad: "Fully explain GPT-4's reasoning capabilities" Start with narrow, well-defined behaviors. **2. Importance**: Does the solution matter? - Good: "Understand how models represent deception" (safety-relevant) - Good: "Find circuits that transfer across model sizes" (methodological) - Mediocre: "Catalog every attention pattern in layer 3" (low insight) **3. Personal fit**: Is this something you can uniquely contribute to? - Your background (performance engineering? linguistics? mathematics?) - Your interests (what do you find fascinating?) - Your resources (compute? collaborators? time?) ### Concrete Problem Types **Feature discovery**: What features exist? - Train SAEs on unexplored layers/models - Find features for specific domains (code, math, safety) - Study feature geometry and clustering **Circuit analysis**: How does capability X work? - Pick a narrow behavior (parenthesis matching, country-capital, etc.) - Apply the full methodology: attribution → ablation → patching → diagram - Consider automated circuit discovery tools (ACDC, CD-T) for efficiency ::: {.callout-note} ## Automated Circuit Discovery (2024-2025) Beyond manual patching, tools like ACDC and CD-T (Compact Discovery via Transformations) can automatically identify minimal circuits. CD-T is particularly efficient for larger models. However, automated methods still require human interpretation of the discovered components—they find *which* components matter, not *why*. ::: **Methodology development**: Better tools - Improved SAE architectures - Faster patching methods - Better visualization tools **Scaling studies**: What changes with size? - Compare circuits across model sizes - Study how features evolve during training - Test whether small-model findings transfer ### Neel Nanda's 200 Problems Neel Nanda maintains a list of 200 concrete open problems. Each is: - Specific enough to start on - Open enough to be original - Calibrated for difficulty This is the best starting point for finding a project. ## The Research Workflow Once you have a problem, how do you make progress? ### Phase 1: Hypothesis Formation Before running experiments, write down your hypothesis: "I believe that [capability X] is implemented by [component Y] because [reasoning Z]." Example: "I believe that parenthesis matching is implemented by attention heads that track nesting depth, because this requires sequential counting across positions." Be specific. Vague hypotheses lead to vague experiments. ### Phase 2: Baseline Measurements Establish what you're measuring: 1. **Behavioral baseline**: What does the model actually do? - Accuracy on your task - Logit differences between correct/incorrect answers - Edge cases and failure modes 2. **Attribution baseline**: What components seem involved? - Run logit attribution on several examples - Note which heads/layers have consistent high attribution - Form preliminary hypotheses ### Phase 3: Intervention Experiments Test your hypotheses with causal interventions: **Ablation sweep**: Ablate each suspected component individually ```python for component in suspected_components: ablated_accuracy = run_with_ablation(component) print(f"{component}: {ablated_accuracy}") ``` **Patching validation**: For the most important components, run patching to confirm causality **Path patching**: Trace connections between important components ### Phase 4: Circuit Synthesis If your experiments succeed: - Draw the circuit diagram - Label each component's function - Write the algorithm in pseudocode - List predictions your circuit makes ### Phase 5: Verification Test your circuit's predictions: - Does ablating the circuit break the behavior? - Does the circuit explain variations in behavior? - Does it generalize to held-out examples? ::: {.callout-important} ## The Iteration Loop Research rarely follows this sequence linearly. You'll form a hypothesis, run experiments, find surprising results, revise the hypothesis, run more experiments. This is normal. The structure is a guide, not a prescription. ::: ## Debugging Interpretability Experiments Things will go wrong. Here's how to debug systematically. ```{mermaid} %%| fig-cap: "Systematic debugging workflow: follow this decision tree when experiments don't work." %%| fig-width: 10 flowchart TD START["Experiment not working"] --> CHECK["Check basics first"] CHECK --> Q1{"Is the model doing the task at all?"} Q1 -->|No| FIX1["Task too hard, or wrong task definition"] Q1 -->|Yes| Q2{"Are attributions making sense?"} Q2 -->|No| FIX2["Wrong layer, or distributed computation"] Q2 -->|Yes| Q3{"Do ablations have effect?"} Q3 -->|No| FIX3["Backup circuits, or wrong ablation type"] Q3 -->|Yes| Q4{"Is patching clean?"} Q4 -->|No| FIX4["Bad clean/corrupted pair, or distribution shift"] Q4 -->|Yes| SUCCESS["Circuit analysis proceeding normally"] FIX1 --> RETRY["Simplify & Retry"] FIX2 --> RETRY FIX3 --> RETRY FIX4 --> RETRY ``` ### Problem: "Nothing has high attribution" **Possible causes**: - Task isn't actually hard (model gets it "for free") - Attribution distributed across many components - Looking at wrong layer **Fixes**: - Choose a harder prompt where the model barely succeeds - Sum attribution across component groups (all layer 5 heads) - Try different layers ### Problem: "Ablation has no effect" **Possible causes**: - Backup circuits compensating - Wrong ablation type (zero vs. mean) - Component not actually used for this task **Fixes**: - Ablate multiple components simultaneously - Try different ablation methods - Verify the component has high attribution first ### Problem: "Patching results are noisy" **Possible causes**: - Bad clean/corrupted pair (too different or too similar) - Distribution shift from patching - Small effect size **Fixes**: - Construct cleaner minimal pairs - Use resample ablation instead of zero - Average over many examples ### Problem: "Can't find the circuit" **Possible causes**: - Behavior is distributed (no clean circuit) - Looking at wrong level (need features, not heads) - Behavior is too complex for current methods **Fixes**: - Try SAE feature-level analysis - Simplify the behavior (narrower task) - Accept partial understanding (important components without full circuit) ### Problem: "Results don't replicate" **Possible causes**: - Random seed sensitivity - Prompt sensitivity - Bugs in code **Fixes**: - Run with multiple seeds - Test on many prompts - Review code carefully (or get someone else to) ::: {.callout-tip} ## The Cardinal Rule When confused, simplify. Reduce to the smallest example that shows the phenomenon. If you can't replicate on a simple example, you probably don't understand the phenomenon. ::: ### When to Change Strategy Sometimes the problem isn't a bug—it's a signal that your approach needs rethinking. ::: {.callout-warning collapse="true"} ## Signs You Should Pivot **Signs the task is too hard for current methods:** - More than 50% of components have moderate attribution (nothing stands out) - Ablating any single component changes behavior by <5% - Results vary wildly across semantically similar prompts - You've spent 2+ weeks without progress **Signs you need a different approach:** | If you're doing... | Try instead... | |-------------------|----------------| | Single-head ablation | Multi-component ablation | | Zero ablation | Mean/resample ablation | | Component-level analysis | SAE feature-level analysis | | Looking at late layers | Looking at earlier layers | | One example | Many examples with statistics | **When to accept partial understanding:** Not every behavior has a clean circuit. It's scientifically valid to report: - "These 5 heads are the most important, but account for only 40% of the effect" - "The behavior appears distributed across many components" - "We found the circuit for the simple case; the complex case remains unclear" Partial results are still valuable—they constrain future research. ::: ### The Debugging Checklist Before concluding that something doesn't work, verify: ```markdown □ Model actually performs the task (check logits, not just sampling) □ Using the right token positions (off-by-one errors are common) □ Cache is from the right input (easy to mix up in notebooks) □ Patching direction is correct (clean→corrupted vs corrupted→clean) □ Ablation value is sensible (zero? mean over dataset? mean over sequence?) □ Batch dimensions are handled correctly □ Running on GPU if expected (CPU can give different numerical results) □ Random seeds are set for reproducibility ``` ## Tools and Infrastructure ### Essential Libraries **TransformerLens**: Core library for hooked model execution - `run_with_cache()` for activation access - `run_with_hooks()` for interventions - Built-in support for common models - As of 2024-2025, supports most major model architectures including Llama, Gemma, and Mistral **SAELens**: Sparse autoencoder training and analysis - Train SAEs on cached activations - Supports TopK, Gated, and JumpReLU SAE architectures - Feature visualization tools - Integration with Neuronpedia for sharing features **CircuitsVis**: Visualization library - Attention pattern visualization - Activation plots - Interactive exploration **Neuronpedia**: Interactive feature explorer ([neuronpedia.org](https://www.neuronpedia.org/)) - Browse SAE features across models - Community-contributed feature labels - As of late 2024, expanded to include features from Gemma 2, Llama 3, and other modern architectures ::: {.callout-tip} ## Evaluation: SAEBench (2025) When evaluating SAE quality, be aware that proxy metrics (reconstruction loss, sparsity, interpretability scores) don't reliably predict practical performance. The SAEBench benchmark suite tests SAEs on downstream tasks like feature steering and circuit discovery. Consider using these practical evaluations rather than relying solely on traditional metrics. ::: ### Compute Considerations **Minimum**: Laptop with 8GB RAM - GPT-2 Small experiments - Small-scale SAE training - Visualization and exploration **Recommended**: GPU with 16GB+ VRAM (or cloud equivalent) - GPT-2 Medium/Large experiments - Production SAE training - Systematic sweeps **Ideal**: Multiple GPUs or cloud compute - Large model experiments - Full circuit discovery - Parameter sweeps Cloud options: Google Colab (free tier for exploration), Modal, Lambda Labs ### Experiment Tracking Track your experiments systematically: - What hypothesis were you testing? - What exact code did you run? - What were the results? - What did you learn? Tools: W&B, MLflow, or even a simple lab notebook. The format matters less than consistency. ## Publishing and Community ### Where to Share **LessWrong/AF**: The primary venue for interpretability results - Conceptual pieces and research writeups - Active community feedback - Low barrier to entry **arXiv/Conferences**: For more polished work - ICML, NeurIPS, ICLR for major results - Mechanistic interpretability workshops **Twitter/X**: For quick results and discussion - Build visibility - Get fast feedback - Connect with researchers ### What Makes a Good Post **Clear claim**: What did you discover? **Reproducible methods**: Code or detailed procedure **Honest limitations**: What doesn't this show? **Connection to context**: How does this fit with prior work? ### The Community Mechanistic interpretability has an unusually collaborative culture: - Researchers share work in progress - Code is typically open source - Feedback is constructive Engage genuinely. Ask questions. Share your work even when preliminary. ## The Learning Progression ### Stage 1: Replication (1-3 months) - Replicate 2-3 known results - Get comfortable with tools - Build intuition for what results "look like" ### Stage 2: Extension (3-6 months) - Take a known result and extend it - "What happens if we do the same analysis on a different task?" - "Does this circuit exist in a different model?" ### Stage 3: Original Research (6+ months) - Find your own circuits - Develop new methods - Contribute to open problems ### Stage 4: Field Building (ongoing) - Mentor newcomers - Build tools - Set research agendas Most people take 3-6 months to make their first original contribution. This is normal. ## A Sample Curriculum **Week 1-2**: Environment setup, basic exploration **Week 3-4**: Replicate induction head finding **Week 5-6**: Replicate IOI circuit (simplified version) **Week 7-8**: Train your first SAE on a small model **Week 9-10**: Choose a problem from Neel Nanda's list **Week 11-12**: Run initial experiments, debug, iterate **Week 13+**: Continue research, write up findings Adjust based on your pace. Some finish faster; some need more time. Both are fine. ## Polya's Perspective: Learning by Doing Polya's central thesis: you learn problem-solving by solving problems—not by reading about solving problems. This entire series has been reading. Essential reading—you need concepts to think with. But reading is preparation, not the destination. The destination is practice. Real experiments on real models, finding real results. ::: {.callout-tip} ## Polya's Insight "Mathematics is not a spectator sport." Neither is interpretability. You understand transformer circuits by analyzing transformer circuits—not by reading chapters about analyzing transformer circuits. This chapter ends; your practice begins. ::: ## Conclusion: The Journey Ahead You now have everything you need to start: - Conceptual foundation - Technical toolkit - Research methodology - Debugging heuristics - Community resources What you don't have—what you can only get through practice—is intuition. The sense of what's worth investigating. The pattern recognition that spots anomalies. The judgment that knows when to push deeper and when to pivot. These come with time and experience. There's no shortcut. Mechanistic interpretability is a young field working on hard problems with uncertain methods. You will get stuck. You will find bugs. You will have weeks where nothing works. This is normal. It's also what makes the field exciting. The open problems from Chapter 14 are genuinely open. Contributions are genuinely possible. Insights that matter are within reach—not only for senior researchers at big labs, but for newcomers with fresh perspectives. The transformer's circuits are waiting. Go find them. --- ## Resources ### Your First Project: Concrete Suggestions ::: {.callout-important} ## Choose Your Adventure Pick ONE project based on your available time: **Weekend Project (~8 hours)** - Replicate induction head finding in GPT-2 Small using the [notebook](../notebooks/13-induction-heads.ipynb) - Find a polysemantic neuron and document what concepts it responds to - Explore 50 SAE features on Neuronpedia and write up the most interesting 5 **Week Project (~20-40 hours)** - Train your own SAE on GPT-2 Small layer 6 using SAE Lens - Replicate the IOI circuit analysis (find the 26 heads) - Find a new "simple" circuit (e.g., detecting questions, predicting punctuation) **Research Project (~1-3 months)** - Investigate a phenomenon from Chapter 14's open problems - Apply interpretability techniques to a new model or task - Develop improved methods for feature finding or circuit discovery ::: ### Getting Started - **TransformerLens Documentation** — [GitHub](https://github.com/TransformerLensOrg/TransformerLens) - **SAELens Documentation** — [GitHub](https://github.com/jbloomAus/SAELens) - **Neel Nanda's 200 Problems** — [neelnanda.io](https://www.neelnanda.io/mechanistic-interpretability/200-concrete-problems) - **ARENA Curriculum** — [GitHub](https://github.com/callummcdougall/ARENA_2.0): Structured exercises for learning interpretability ### Community - **EleutherAI Discord** — [discord.gg/eleutherai](https://discord.gg/eleutherai): Active #interpretability channel, great for questions - **Alignment Forum** — [alignmentforum.org](https://www.alignmentforum.org/): Interpretability research discussions and papers - **MATS Program** — [matsprogram.org](https://www.matsprogram.org/): Mentored research, applications twice yearly - **AI Safety Camp** — [aisafety.camp](https://aisafety.camp/): Intensive research programs ### Who to Follow - **Neel Nanda** ([@neaboris](https://twitter.com/neaboris)): TransformerLens creator, prolific educator - **Chris Olah** ([@ch402](https://twitter.com/ch402)): Anthropic interpretability lead - **Anthropic Interpretability Team** ([@AnthropicAI](https://twitter.com/AnthropicAI)): Major research releases - **Joseph Bloom** ([@jbloomaus](https://twitter.com/jbloomaus)): SAE Lens creator ### Reference - **TransformerLens Docs** — Complete API reference - **Anthropic's Circuits Papers** — [transformer-circuits.pub](https://transformer-circuits.pub/) - **Neuronpedia** — [neuronpedia.org](https://www.neuronpedia.org/): Interactive feature explorer

21.1 From Theory to Practice

21.2 Starting From Zero

21.2.1 Week 1: Environment Setup

21.2.2 Week 2: Replicate a Known Result

21.2.3 Week 3: Your First Original Observation

21.3 Choosing a Research Problem

21.3.1 The Problem Selection Framework

21.3.2 Concrete Problem Types

21.3.3 Neel Nanda’s 200 Problems

21.4 The Research Workflow

21.4.1 Phase 1: Hypothesis Formation

21.4.2 Phase 2: Baseline Measurements

21.4.3 Phase 3: Intervention Experiments

21.4.4 Phase 4: Circuit Synthesis

21.4.5 Phase 5: Verification

21.5 Debugging Interpretability Experiments

21.5.1 Problem: “Nothing has high attribution”

21.5.2 Problem: “Ablation has no effect”

21.5.3 Problem: “Patching results are noisy”

21.5.4 Problem: “Can’t find the circuit”

21.5.5 Problem: “Results don’t replicate”

21.5.6 When to Change Strategy

21.5.7 The Debugging Checklist

21.6 Tools and Infrastructure

21.6.1 Essential Libraries

21.6.2 Compute Considerations

21.6.3 Experiment Tracking

21.7 Publishing and Community

21.7.1 Where to Share

21.7.2 What Makes a Good Post

21.7.3 The Community

21.8 The Learning Progression

21.8.1 Stage 1: Replication (1-3 months)

21.8.2 Stage 2: Extension (3-6 months)

21.8.3 Stage 3: Original Research (6+ months)

21.8.4 Stage 4: Field Building (ongoing)

21.9 A Sample Curriculum

21.10 Polya’s Perspective: Learning by Doing

21.11 Conclusion: The Journey Ahead

21.12 Resources

21.12.1 Your First Project: Concrete Suggestions

21.12.2 Getting Started

21.12.3 Community

21.12.4 Who to Follow

21.12.5 Reference