20 Open Problems
Where the field stands and where it needs to go
- The scaling problem: why techniques that work on GPT-2 may not work on GPT-4
- The coverage problem: how much of model behavior can we actually explain?
- The validation problem: how do we know our interpretations are correct?
- Where the field needs breakthrough ideas vs. incremental progress
Required: Chapter 13: Induction Heads — seeing the full interpretability pipeline in action
From Chapter 13 (Induction Heads), recall:
- Induction heads are the best-understood circuit in transformers
- We verified them with attribution, patching, and ablation
- They emerge suddenly during training (phase transition)
- They enable in-context learning (few-shot prompting)
We’ve seen interpretability work beautifully on one circuit. Now we ask: Does this scale? What can’t we do?
20.1 Intellectual Honesty
We’ve built a compelling toolkit: - A theory of features, superposition, and circuits - Techniques for attribution, patching, ablation, and feature extraction - A complete case study (induction heads) showing these methods in action
It would be easy to conclude that mechanistic interpretability is “solved”—that we just need to apply these techniques to larger models to understand AI systems completely.
This conclusion would be wrong.
The honest assessment: mechanistic interpretability has achieved meaningful successes, but fundamental problems remain unsolved. The gap between “understanding induction heads in GPT-2” and “understanding everything GPT-4 can do” is vast—and it’s not clear our current methods can bridge it.
This chapter surveys the open problems honestly. Some may be solved with incremental progress. Others may require fundamentally new ideas.
We can understand some things about some models with some confidence. We cannot yet understand all things about any model with high confidence. The gap matters for safety.
20.2 The Scaling Problem
Our best successes are on small models. GPT-2 Small (124M parameters) has yielded beautiful, complete circuits: IOI, induction heads, greater-than comparison.
Large models (70B+ parameters) are different.
20.2.1 What Changes at Scale
More components: GPT-2 Small has 12 layers × 12 heads = 144 attention heads. GPT-4 has an estimated 120 layers × 128 heads = 15,360+ heads. Manual inspection doesn’t scale.
(If you spent one minute understanding each head in GPT-4, you’d need 256 hours—over six 40-hour work weeks—just for the attention heads. And you haven’t touched the MLPs.)
More redundancy: Large models have backup circuits and distributed representations. Ablating one head has minimal effect because others compensate. This makes circuits harder to isolate.
More composition: Complex behaviors emerge from many components working together. The IOI circuit had 26 heads in 7 functional groups. A reasoning circuit in GPT-4 might involve hundreds of components across dozens of layers.
Different algorithms: Large models might implement qualitatively different algorithms than small models. The induction head that works beautifully in GPT-2 might be replaced by something more sophisticated in GPT-4.
20.2.2 Current Status
SAEs have been applied to Claude 3 Sonnet (a production-scale model), finding millions of interpretable features. This is genuine progress.
But feature extraction isn’t circuit analysis. We don’t have complete circuit diagrams for any complex capability in any large model. The scaling gap is real.
20.2.3 What’s Needed
- Automated circuit discovery that scales beyond thousands of components
- Abstraction methods that summarize many components as single functional units
- Transfer learning for interpretability (apply findings from small models to large ones)
Compute has historically solved problems that manual effort couldn’t. Perhaps interpretability will follow the same pattern: automated methods that scale with model size, rather than human-intensive analysis.
20.3 The Coverage Problem
How much of model behavior can we explain?
20.3.1 Explained vs. Unexplained
For induction heads: high coverage. We understand the algorithm, the components, the composition. The circuit explains ~70-85% of in-context learning performance.
For general language modeling: low coverage. What circuit produces creative writing? What circuit decides whether a joke is funny? What circuit determines that a political statement might be controversial?
Most of what language models do remains unexplained. The techniques work—but they’ve been applied to a tiny fraction of model capabilities.
20.3.2 The Long Tail
Capabilities follow a long-tail distribution: - A few common patterns (copying, retrieval, pattern completion) - Many rare capabilities (specific facts, unusual reasoning, creative synthesis)
Current interpretability succeeds on the head of the distribution—common, stereotyped behaviors. The tail is harder: each capability may have its own circuit, requiring individual analysis.
Implication: Full interpretability might require understanding millions of circuits, not dozens.
20.3.3 The Compositionality Challenge
Complex behaviors aren’t single circuits—they’re compositions of circuits:
“Explain why the French Revolution happened” requires: - Historical knowledge retrieval - Causal reasoning - Narrative construction - Language generation - Audience modeling
Each component might be its own circuit. Understanding the composition—how circuits interact to produce coherent explanations—is harder than understanding circuits in isolation.
20.4 The Validation Problem
How do we know our interpretations are correct?
20.4.1 The Ground Truth Problem
For toy models: we know the ground truth (we created it). We can verify that the network learned the intended features.
For language models: there is no ground truth. We infer features from behavior, then check whether those features explain behavior. This is circular—confirmation bias is a constant risk.
20.4.2 Multiple SAEs, Different Features
Train two SAEs on the same activations with different random seeds. They produce different features.
Which is “right”? Both explain the activations. Both have interpretable features. But they’re not identical.
Implication: SAE features might be artifacts of the extraction process, not genuine properties of the model.
20.4.3 Causal vs. Correlational
Patching and ablation provide causal tests. But even causal evidence has limits:
Sufficiency vs. necessity: A component might be sufficient for a behavior without being how the model normally does it. Backup circuits mean patching might restore behavior through abnormal paths.
Distribution shift: Interventions create out-of-distribution activations. The measured causal effect might differ from the natural effect under normal operation.
Emergent effects: Removing component A might change how component B behaves, confounding interpretation.
20.4.4 What Would Validation Look Like?
Stronger validation might require: - Predictions on held-out behaviors: “This circuit should also explain X, Y, Z” - Intervention success: “Modifying this circuit should change behavior in predicted ways” - Cross-model consistency: “Similar models should have similar circuits”
Current interpretability has some of these, but not systematically.
Is “the IOI circuit is the mechanism for indirect object identification” falsifiable? What evidence would prove it wrong? If we can’t answer this clearly, interpretability risks being unfalsifiable—which would undermine its scientific status.
Recent work (2024) has made progress on this question. Researchers found that purpose-built minimal models can solve IOI with just 2 attention heads—far fewer than the 26 found in GPT-2 Small. This comparison helps distinguish between “how this model does it” (26 heads, with redundancy) and “what’s minimally necessary” (2 heads). Studying minimal circuits alongside pretrained circuits may provide the falsifiability we need.
20.5 The Alignment Tax
Interpretability is expensive. Is it worth it?
20.5.1 The Cost
Computational cost: SAE training requires billions of tokens of cached activations. Patching requires many forward passes. Full circuit analysis takes weeks of GPU time.
Human cost: Interpreting features requires human judgment. Circuit verification requires expert analysis. This doesn’t scale to millions of features.
Capability cost: Time spent on interpretability is time not spent on capability improvement. Organizations must choose.
20.5.2 The Benefit
The safety case for interpretability: - Detect deceptive alignment before deployment - Identify unsafe behaviors at the representation level - Verify that safety training actually changed the right things - Enable targeted intervention on problematic behaviors
But these benefits are theoretical. No major safety incident has been prevented by interpretability. The counterfactual is hard to establish.
20.5.3 The Alignment Tax Question
If interpretability is very expensive and provides uncertain benefits, will organizations invest in it?
Market pressures push toward capability. Interpretability is a cost with unclear return on investment. This creates incentives to skip or minimize interpretability work.
Policy implication: External requirements (regulation, standards, liability) may be necessary to ensure interpretability investment.
20.6 Safety Applications: Current State
Despite the challenges, interpretability has concrete safety applications today—and more ambitious goals for the future.
20.6.1 What’s Already Possible
Feature-based safety classifiers: Use SAE features to detect concerning content:
This is more interpretable than a black-box classifier—you can inspect which features triggered.
Steering for safer outputs: Suppress features associated with harmful behavior during generation (see Chapter 9’s steering section). Unlike fine-tuning, this is reversible and inspectable.
Behavioral verification: After safety training (RLHF, Constitutional AI), use interpretability to check what changed: - Did the “deception” feature’s activation pattern change? - Did new safety-related features emerge? - Are the changes localized or distributed?
20.6.2 Safety-Relevant Features Discovered
Anthropic’s SAE work on Claude found features for:
| Feature Type | Example | Potential Application |
|---|---|---|
| Deception | “Being dishonest or misleading” | Detect deceptive reasoning |
| Sycophancy | “Agreeing with user despite knowing better” | Monitor for sycophantic drift |
| Harmful content | Violent scenarios, exploitation | Content filtering |
| Uncertainty | “I don’t know” indicators | Calibration monitoring |
| Unsafe code | Security vulnerabilities | Code safety |
| Power-seeking | Goal preservation, influence | Alignment monitoring |
These are early findings—more systematic mapping of safety-relevant features is ongoing.
20.6.3 The Safety Research Agenda
Near-term (currently tractable): 1. Catalog safety-relevant features across models 2. Build feature-based monitoring dashboards 3. Use steering for targeted safety interventions 4. Verify that safety training modifies the right features
Medium-term (requires progress on open problems): 1. Detect deceptive alignment before deployment 2. Verify absence of concerning capabilities 3. Understand how safety properties emerge in training 4. Build interpretability-verified safety constraints
Long-term (aspirational): 1. Formal proofs of safety properties from interpretability 2. Real-time interpretability during deployment 3. Interpretability as a core component of AI governance
Interpretability has found safety-relevant features. We can steer and monitor. But:
- We can’t yet prove a model lacks deceptive capabilities
- Features might hide deceptive computations (see “Deceptive Interpretability” warning above)
- Coverage is incomplete—most capabilities aren’t analyzed
- Scale remains a challenge for production models
Interpretability is a useful safety tool today. It is not yet sufficient for strong safety guarantees.
20.7 Superposition: Friend or Enemy?
Superposition is central to how neural networks work. It’s also central to why interpretability is hard.
20.7.1 The Dilemma
Superposition enables compression: millions of features in thousands of dimensions. This is why language models are so capable—they pack vast knowledge into manageable parameter counts.
But superposition creates polysemanticity: each neuron, each activation dimension, encodes multiple features. This is why interpretability is hard—there’s no clean mapping from components to concepts.
20.7.2 SAEs: The Current Solution
Sparse autoencoders “undo” superposition, recovering monosemantic features from polysemantic activations.
But SAEs are imperfect: - 10-40% reconstruction error means some information is lost or never captured - Features split and absorb as dictionary size changes - No ground truth for whether SAE features match “true” model features
20.7.3 Could We Train Without Superposition?
What if we trained models to avoid superposition—to use monosemantic representations from the start?
Attempts: Softmax linear units (SoLU), dictionary learning during training, sparsity constraints.
Results: Promising in small experiments. Large-scale viability unknown.
Trade-off: If superposition is computationally optimal (as toy models suggest), avoiding it might require larger models for the same capability—a capability tax for interpretability.
20.8 Emergent Capabilities
Large language models exhibit capabilities that weren’t explicitly trained: - Arithmetic (up to some limit) - Translation between languages not paired in training data - Reasoning about novel scenarios - Theory of mind (modeling other agents’ beliefs)
20.8.1 The Emergence Problem
Where do these capabilities come from? They weren’t explicitly optimized. They emerged as side effects of language modeling.
Current interpretability has trouble with emergence: - We can’t predict which capabilities will emerge at which scale - We can’t explain why language modeling produces arithmetic - We can’t verify whether an emergent capability is robust or brittle
20.8.2 The Safety Implications
If capabilities emerge unpredictably, so might: - Deceptive behaviors - Power-seeking tendencies - Goal generalization beyond training
Interpretability would ideally detect these emergent problems. Currently, we can’t—we don’t know where to look until after the capability appears.
Can we build interpretability tools that detect unknown capabilities or behaviors? This requires moving beyond “understand the circuit for X” to “find any circuits with concerning properties”—a harder problem.
A more troubling possibility: what if interpretability methods themselves can be fooled? Recent theoretical and empirical work explores scenarios where models could develop representations that appear interpretable but hide actual computations. Features that seem to track “honesty” might not actually govern the model’s behavior. This isn’t paranoia—it’s a genuine methodological concern. The proxy metric gap (SAEBench, 2025) showed that interpretability metrics don’t reliably predict practical utility, raising deeper questions about what our tools actually measure.
20.9 Specific Open Questions
Beyond these broad challenges, specific technical questions remain open:
20.9.1 1. What’s the Right Level of Abstraction?
Circuits can be described at many levels: - Individual neurons - Feature directions - Attention head functions - Layer-wise transformations - Abstract algorithms
Which level is “right”? Different levels may be appropriate for different questions. But we lack a principled way to choose.
20.9.2 2. How Do We Handle Continuous Features?
Toy models treat features as binary (on/off). Real features are continuous and graded. How do continuous features compose? When does “slightly active” become “importantly active”?
20.9.3 3. What About Temporal Dynamics?
Our techniques analyze single forward passes. But many capabilities develop over sequences: - Context building across dialogue - Refinement through iteration - Planning over multiple outputs
How do we interpret dynamics, not just snapshots?
20.9.4 4. Can We Interpret Training?
Interpretability focuses on trained models. But safety might require understanding the training process: - Which examples teach which capabilities? - How do circuits form during training? - Can we predict training outcomes from early signals?
Training dynamics are much less understood than forward pass dynamics.
20.9.5 5. How Do We Scale Interpretation?
Even with SAEs finding millions of features, interpreting them requires human attention. How do we: - Automatically label features (current LLM-based methods are imperfect) - Find features relevant to specific behaviors - Summarize feature sets at higher abstraction levels
20.9.6 6. What’s the Theory?
We have empirical observations (polytopes, phase transitions, induction heads) but limited theoretical understanding: - Why do transformers represent features as directions? - What determines which circuits emerge during training? - Is there a computational theory of interpretation?
20.10 The Path Forward
Despite these problems, progress is possible.
20.10.1 Near-Term
Automated circuit discovery: Tools like ACDC that scale patching to larger models.
Better SAEs: Architectures with lower reconstruction error, less absorption, more consistent features.
Benchmark development: Standardized evaluations for interpretability methods—like ImageNet for circuit analysis.
20.10.2 Medium-Term
End-to-end interpretability: Training models with built-in interpretability constraints.
Formal verification: Mathematical proofs of model properties, not just empirical observations.
Integration with safety: Using interpretability for practical safety applications (detecting deception, verifying alignment).
20.10.3 Long-Term
Complete model understanding: The ability to fully explain any model output in terms of interpretable features and circuits.
Predictive interpretability: Understanding training well enough to predict model properties before training.
Interpretability by design: AI architectures that are interpretable from the start, without post-hoc analysis.
20.11 Polya’s Perspective: Acknowledging Unknowns
Polya emphasizes: “Understand what you don’t understand.”
Intellectual progress requires honestly identifying gaps. This chapter maps the gaps in mechanistic interpretability—not to discourage work, but to focus it.
The problems are real. The problems are hard. But problems clearly stated are problems that can be worked on.
“What is unknown?” is as important as “what is known.” Honest acknowledgment of limitations guides research toward the most important problems. The open questions in this chapter are the research agenda for the field.
20.12 Looking Ahead
The final chapter offers a Practice Regime—concrete guidance for actually doing interpretability research:
- How to choose problems
- How to structure experiments
- How to debug circuits that don’t work
- How to publish and share findings
This series has been theory and concepts. The next chapter is about practice.
20.13 Further Reading
200 Concrete Open Problems in Mechanistic Interpretability — Neel Nanda: Exhaustive list of specific research questions.
Towards Monosemanticity (Limitations Section) — Anthropic: Honest discussion of SAE limitations by the developers.
Causal Scrubbing: Rigorous Circuit Evaluation — Redwood Research: Attempts to address the validation problem.
SAEBench: Evaluating SAEs on Practical Tasks — arXiv: Comprehensive benchmark revealing the proxy metric gap.
Interp Benchmarks (Proposal) — Various: Efforts toward standardized interpretability evaluation.
Softmax Linear Units for Interpretability — Anthropic: Attempts to train more interpretable models from scratch.
Scaling Interpretability Research — MATS: Programs for training interpretability researchers and scaling the field.