20 Open Problems

Where the field stands and where it needs to go

synthesis

open-problems

Author

Taras Tsugrii

Published

January 5, 2025

What You’ll Learn

The scaling problem: why techniques that work on GPT-2 may not work on GPT-4
The coverage problem: how much of model behavior can we actually explain?
The validation problem: how do we know our interpretations are correct?
Where the field needs breakthrough ideas vs. incremental progress

Prerequisites

Required: Chapter 13: Induction Heads — seeing the full interpretability pipeline in action

Before You Read: Recall

From Chapter 13 (Induction Heads), recall:

Induction heads are the best-understood circuit in transformers
We verified them with attribution, patching, and ablation
They emerge suddenly during training (phase transition)
They enable in-context learning (few-shot prompting)

We’ve seen interpretability work beautifully on one circuit. Now we ask: Does this scale? What can’t we do?

20.1 Intellectual Honesty

We’ve built a compelling toolkit: - A theory of features, superposition, and circuits - Techniques for attribution, patching, ablation, and feature extraction - A complete case study (induction heads) showing these methods in action

It would be easy to conclude that mechanistic interpretability is “solved”—that we just need to apply these techniques to larger models to understand AI systems completely.

This conclusion would be wrong.

The honest assessment: mechanistic interpretability has achieved meaningful successes, but fundamental problems remain unsolved. The gap between “understanding induction heads in GPT-2” and “understanding everything GPT-4 can do” is vast—and it’s not clear our current methods can bridge it.

This chapter surveys the open problems honestly. Some may be solved with incremental progress. Others may require fundamentally new ideas.

The Responsible Claim

We can understand some things about some models with some confidence. We cannot yet understand all things about any model with high confidence. The gap matters for safety.

20.2 The Scaling Problem

Our best successes are on small models. GPT-2 Small (124M parameters) has yielded beautiful, complete circuits: IOI, induction heads, greater-than comparison.

Large models (70B+ parameters) are different.

20.2.1 What Changes at Scale

More components: GPT-2 Small has 12 layers × 12 heads = 144 attention heads. GPT-4 has an estimated 120 layers × 128 heads = 15,360+ heads. Manual inspection doesn’t scale.

(If you spent one minute understanding each head in GPT-4, you’d need 256 hours—over six 40-hour work weeks—just for the attention heads. And you haven’t touched the MLPs.)

More redundancy: Large models have backup circuits and distributed representations. Ablating one head has minimal effect because others compensate. This makes circuits harder to isolate.

More composition: Complex behaviors emerge from many components working together. The IOI circuit had 26 heads in 7 functional groups. A reasoning circuit in GPT-4 might involve hundreds of components across dozens of layers.

Different algorithms: Large models might implement qualitatively different algorithms than small models. The induction head that works beautifully in GPT-2 might be replaced by something more sophisticated in GPT-4.

20.2.2 Current Status

SAEs have been applied to Claude 3 Sonnet (a production-scale model), finding millions of interpretable features. This is genuine progress.

But feature extraction isn’t circuit analysis. We don’t have complete circuit diagrams for any complex capability in any large model. The scaling gap is real.

20.2.3 What’s Needed

Automated circuit discovery that scales beyond thousands of components
Abstraction methods that summarize many components as single functional units
Transfer learning for interpretability (apply findings from small models to large ones)

The Bitter Lesson Applied

Compute has historically solved problems that manual effort couldn’t. Perhaps interpretability will follow the same pattern: automated methods that scale with model size, rather than human-intensive analysis.

20.3 The Coverage Problem

How much of model behavior can we explain?

20.3.1 Explained vs. Unexplained

For induction heads: high coverage. We understand the algorithm, the components, the composition. The circuit explains ~70-85% of in-context learning performance.

For general language modeling: low coverage. What circuit produces creative writing? What circuit decides whether a joke is funny? What circuit determines that a political statement might be controversial?

Most of what language models do remains unexplained. The techniques work—but they’ve been applied to a tiny fraction of model capabilities.

20.3.2 The Long Tail

Capabilities follow a long-tail distribution: - A few common patterns (copying, retrieval, pattern completion) - Many rare capabilities (specific facts, unusual reasoning, creative synthesis)

Current interpretability succeeds on the head of the distribution—common, stereotyped behaviors. The tail is harder: each capability may have its own circuit, requiring individual analysis.

Implication: Full interpretability might require understanding millions of circuits, not dozens.

20.3.3 The Compositionality Challenge

Complex behaviors aren’t single circuits—they’re compositions of circuits:

“Explain why the French Revolution happened” requires: - Historical knowledge retrieval - Causal reasoning - Narrative construction - Language generation - Audience modeling

Each component might be its own circuit. Understanding the composition—how circuits interact to produce coherent explanations—is harder than understanding circuits in isolation.

20.4 The Validation Problem

How do we know our interpretations are correct?

20.4.1 The Ground Truth Problem

For toy models: we know the ground truth (we created it). We can verify that the network learned the intended features.

For language models: there is no ground truth. We infer features from behavior, then check whether those features explain behavior. This is circular—confirmation bias is a constant risk.

20.4.2 Multiple SAEs, Different Features

Train two SAEs on the same activations with different random seeds. They produce different features.

Which is “right”? Both explain the activations. Both have interpretable features. But they’re not identical.

Implication: SAE features might be artifacts of the extraction process, not genuine properties of the model.

20.4.3 Causal vs. Correlational

Patching and ablation provide causal tests. But even causal evidence has limits:

Sufficiency vs. necessity: A component might be sufficient for a behavior without being how the model normally does it. Backup circuits mean patching might restore behavior through abnormal paths.
Distribution shift: Interventions create out-of-distribution activations. The measured causal effect might differ from the natural effect under normal operation.
Emergent effects: Removing component A might change how component B behaves, confounding interpretation.

20.4.4 What Would Validation Look Like?

Stronger validation might require: - Predictions on held-out behaviors: “This circuit should also explain X, Y, Z” - Intervention success: “Modifying this circuit should change behavior in predicted ways” - Cross-model consistency: “Similar models should have similar circuits”

Current interpretability has some of these, but not systematically.

The Falsifiability Question

Is “the IOI circuit is the mechanism for indirect object identification” falsifiable? What evidence would prove it wrong? If we can’t answer this clearly, interpretability risks being unfalsifiable—which would undermine its scientific status.

A Positive Development: Minimal Circuit Research

Recent work (2024) has made progress on this question. Researchers found that purpose-built minimal models can solve IOI with just 2 attention heads—far fewer than the 26 found in GPT-2 Small. This comparison helps distinguish between “how this model does it” (26 heads, with redundancy) and “what’s minimally necessary” (2 heads). Studying minimal circuits alongside pretrained circuits may provide the falsifiability we need.

20.5 The Alignment Tax

Interpretability is expensive. Is it worth it?

20.5.1 The Cost

Computational cost: SAE training requires billions of tokens of cached activations. Patching requires many forward passes. Full circuit analysis takes weeks of GPU time.

Human cost: Interpreting features requires human judgment. Circuit verification requires expert analysis. This doesn’t scale to millions of features.

Capability cost: Time spent on interpretability is time not spent on capability improvement. Organizations must choose.

20.5.2 The Benefit

The safety case for interpretability: - Detect deceptive alignment before deployment - Identify unsafe behaviors at the representation level - Verify that safety training actually changed the right things - Enable targeted intervention on problematic behaviors

But these benefits are theoretical. No major safety incident has been prevented by interpretability. The counterfactual is hard to establish.

20.5.3 The Alignment Tax Question

If interpretability is very expensive and provides uncertain benefits, will organizations invest in it?

Market pressures push toward capability. Interpretability is a cost with unclear return on investment. This creates incentives to skip or minimize interpretability work.

Policy implication: External requirements (regulation, standards, liability) may be necessary to ensure interpretability investment.

20.6 Safety Applications: Current State

Despite the challenges, interpretability has concrete safety applications today—and more ambitious goals for the future.

20.6.1 What’s Already Possible

Feature-based safety classifiers: Use SAE features to detect concerning content:

# Conceptual: Check if input activates "harmful" features
harmful_features = [4721, 8903, 12456]  # Hypothetical indices
activations = sae.encode(model_activations)
safety_score = sum(activations[f] for f in harmful_features)
if safety_score > threshold:
    flag_for_review()

This is more interpretable than a black-box classifier—you can inspect which features triggered.

Steering for safer outputs: Suppress features associated with harmful behavior during generation (see Chapter 9’s steering section). Unlike fine-tuning, this is reversible and inspectable.

Behavioral verification: After safety training (RLHF, Constitutional AI), use interpretability to check what changed: - Did the “deception” feature’s activation pattern change? - Did new safety-related features emerge? - Are the changes localized or distributed?

20.6.2 Safety-Relevant Features Discovered

Anthropic’s SAE work on Claude found features for:

Feature Type	Example	Potential Application
Deception	“Being dishonest or misleading”	Detect deceptive reasoning
Sycophancy	“Agreeing with user despite knowing better”	Monitor for sycophantic drift
Harmful content	Violent scenarios, exploitation	Content filtering
Uncertainty	“I don’t know” indicators	Calibration monitoring
Unsafe code	Security vulnerabilities	Code safety
Power-seeking	Goal preservation, influence	Alignment monitoring

These are early findings—more systematic mapping of safety-relevant features is ongoing.

20.6.3 The Safety Research Agenda

Near-term (currently tractable): 1. Catalog safety-relevant features across models 2. Build feature-based monitoring dashboards 3. Use steering for targeted safety interventions 4. Verify that safety training modifies the right features

Medium-term (requires progress on open problems): 1. Detect deceptive alignment before deployment 2. Verify absence of concerning capabilities 3. Understand how safety properties emerge in training 4. Build interpretability-verified safety constraints

Long-term (aspirational): 1. Formal proofs of safety properties from interpretability 2. Real-time interpretability during deployment 3. Interpretability as a core component of AI governance

The Honest Assessment

Interpretability has found safety-relevant features. We can steer and monitor. But:

We can’t yet prove a model lacks deceptive capabilities
Features might hide deceptive computations (see “Deceptive Interpretability” warning above)
Coverage is incomplete—most capabilities aren’t analyzed
Scale remains a challenge for production models

Interpretability is a useful safety tool today. It is not yet sufficient for strong safety guarantees.

20.7 Superposition: Friend or Enemy?

Superposition is central to how neural networks work. It’s also central to why interpretability is hard.

20.7.1 The Dilemma

Superposition enables compression: millions of features in thousands of dimensions. This is why language models are so capable—they pack vast knowledge into manageable parameter counts.

But superposition creates polysemanticity: each neuron, each activation dimension, encodes multiple features. This is why interpretability is hard—there’s no clean mapping from components to concepts.

20.7.2 SAEs: The Current Solution

Sparse autoencoders “undo” superposition, recovering monosemantic features from polysemantic activations.

But SAEs are imperfect: - 10-40% reconstruction error means some information is lost or never captured - Features split and absorb as dictionary size changes - No ground truth for whether SAE features match “true” model features

20.7.3 Could We Train Without Superposition?

What if we trained models to avoid superposition—to use monosemantic representations from the start?

Attempts: Softmax linear units (SoLU), dictionary learning during training, sparsity constraints.

Results: Promising in small experiments. Large-scale viability unknown.

Trade-off: If superposition is computationally optimal (as toy models suggest), avoiding it might require larger models for the same capability—a capability tax for interpretability.

20.8 Emergent Capabilities

Large language models exhibit capabilities that weren’t explicitly trained: - Arithmetic (up to some limit) - Translation between languages not paired in training data - Reasoning about novel scenarios - Theory of mind (modeling other agents’ beliefs)

20.8.1 The Emergence Problem

Where do these capabilities come from? They weren’t explicitly optimized. They emerged as side effects of language modeling.

Current interpretability has trouble with emergence: - We can’t predict which capabilities will emerge at which scale - We can’t explain why language modeling produces arithmetic - We can’t verify whether an emergent capability is robust or brittle

20.8.2 The Safety Implications

If capabilities emerge unpredictably, so might: - Deceptive behaviors - Power-seeking tendencies - Goal generalization beyond training

Interpretability would ideally detect these emergent problems. Currently, we can’t—we don’t know where to look until after the capability appears.

The Detection Challenge

Can we build interpretability tools that detect unknown capabilities or behaviors? This requires moving beyond “understand the circuit for X” to “find any circuits with concerning properties”—a harder problem.

Deceptive Interpretability (2024-2025 Concern)

A more troubling possibility: what if interpretability methods themselves can be fooled? Recent theoretical and empirical work explores scenarios where models could develop representations that appear interpretable but hide actual computations. Features that seem to track “honesty” might not actually govern the model’s behavior. This isn’t paranoia—it’s a genuine methodological concern. The proxy metric gap (SAEBench, 2025) showed that interpretability metrics don’t reliably predict practical utility, raising deeper questions about what our tools actually measure.

20.9 Specific Open Questions

Beyond these broad challenges, specific technical questions remain open:

20.9.1 1. What’s the Right Level of Abstraction?

Circuits can be described at many levels: - Individual neurons - Feature directions - Attention head functions - Layer-wise transformations - Abstract algorithms

Which level is “right”? Different levels may be appropriate for different questions. But we lack a principled way to choose.

20.9.2 2. How Do We Handle Continuous Features?

Toy models treat features as binary (on/off). Real features are continuous and graded. How do continuous features compose? When does “slightly active” become “importantly active”?

20.9.3 3. What About Temporal Dynamics?

Our techniques analyze single forward passes. But many capabilities develop over sequences: - Context building across dialogue - Refinement through iteration - Planning over multiple outputs

How do we interpret dynamics, not just snapshots?

20.9.4 4. Can We Interpret Training?

Interpretability focuses on trained models. But safety might require understanding the training process: - Which examples teach which capabilities? - How do circuits form during training? - Can we predict training outcomes from early signals?

Training dynamics are much less understood than forward pass dynamics.

20.9.5 5. How Do We Scale Interpretation?

Even with SAEs finding millions of features, interpreting them requires human attention. How do we: - Automatically label features (current LLM-based methods are imperfect) - Find features relevant to specific behaviors - Summarize feature sets at higher abstraction levels

20.9.6 6. What’s the Theory?

We have empirical observations (polytopes, phase transitions, induction heads) but limited theoretical understanding: - Why do transformers represent features as directions? - What determines which circuits emerge during training? - Is there a computational theory of interpretation?

20.10 The Path Forward

Despite these problems, progress is possible.

20.10.1 Near-Term

Automated circuit discovery: Tools like ACDC that scale patching to larger models.

Better SAEs: Architectures with lower reconstruction error, less absorption, more consistent features.

Benchmark development: Standardized evaluations for interpretability methods—like ImageNet for circuit analysis.

20.10.2 Medium-Term

End-to-end interpretability: Training models with built-in interpretability constraints.

Formal verification: Mathematical proofs of model properties, not just empirical observations.

Integration with safety: Using interpretability for practical safety applications (detecting deception, verifying alignment).

20.10.3 Long-Term

Complete model understanding: The ability to fully explain any model output in terms of interpretable features and circuits.

Predictive interpretability: Understanding training well enough to predict model properties before training.

Interpretability by design: AI architectures that are interpretable from the start, without post-hoc analysis.

20.11 Polya’s Perspective: Acknowledging Unknowns

Polya emphasizes: “Understand what you don’t understand.”

Intellectual progress requires honestly identifying gaps. This chapter maps the gaps in mechanistic interpretability—not to discourage work, but to focus it.

The problems are real. The problems are hard. But problems clearly stated are problems that can be worked on.

Polya’s Insight

“What is unknown?” is as important as “what is known.” Honest acknowledgment of limitations guides research toward the most important problems. The open questions in this chapter are the research agenda for the field.

20.12 Looking Ahead

The final chapter offers a Practice Regime—concrete guidance for actually doing interpretability research:

How to choose problems
How to structure experiments
How to debug circuits that don’t work
How to publish and share findings

This series has been theory and concepts. The next chapter is about practice.

20.13 Further Reading

200 Concrete Open Problems in Mechanistic Interpretability — Neel Nanda: Exhaustive list of specific research questions.
Towards Monosemanticity (Limitations Section) — Anthropic: Honest discussion of SAE limitations by the developers.
Causal Scrubbing: Rigorous Circuit Evaluation — Redwood Research: Attempts to address the validation problem.
SAEBench: Evaluating SAEs on Practical Tasks — arXiv: Comprehensive benchmark revealing the proxy metric gap.
Interp Benchmarks (Proposal) — Various: Efforts toward standardized interpretability evaluation.
Softmax Linear Units for Interpretability — Anthropic: Attempts to train more interpretable models from scratch.
Scaling Interpretability Research — MATS: Programs for training interpretability researchers and scaling the field.

--- title: "Open Problems" subtitle: "Where the field stands and where it needs to go" author: "Taras Tsugrii" date: 2025-01-05 categories: [synthesis, open-problems] description: "An honest assessment of what mechanistic interpretability can and cannot do today, and the hard problems that remain unsolved." --- ::: {.callout-tip} ## What You'll Learn - The scaling problem: why techniques that work on GPT-2 may not work on GPT-4 - The coverage problem: how much of model behavior can we actually explain? - The validation problem: how do we know our interpretations are correct? - Where the field needs breakthrough ideas vs. incremental progress ::: ::: {.callout-warning} ## Prerequisites **Required**: [Chapter 13: Induction Heads](13-induction-heads.qmd) — seeing the full interpretability pipeline in action ::: ::: {.callout-note} ## Before You Read: Recall From Chapter 13 (Induction Heads), recall: - Induction heads are the best-understood circuit in transformers - We verified them with attribution, patching, and ablation - They emerge suddenly during training (phase transition) - They enable in-context learning (few-shot prompting) We've seen interpretability work beautifully on one circuit. **Now we ask**: Does this scale? What can't we do? ::: ## Intellectual Honesty We've built a compelling toolkit: - A theory of features, superposition, and circuits - Techniques for attribution, patching, ablation, and feature extraction - A complete case study (induction heads) showing these methods in action It would be easy to conclude that mechanistic interpretability is "solved"—that we just need to apply these techniques to larger models to understand AI systems completely. This conclusion would be wrong. The honest assessment: mechanistic interpretability has achieved meaningful successes, but fundamental problems remain unsolved. The gap between "understanding induction heads in GPT-2" and "understanding everything GPT-4 can do" is vast—and it's not clear our current methods can bridge it. This chapter surveys the open problems honestly. Some may be solved with incremental progress. Others may require fundamentally new ideas. ::: {.callout-important} ## The Responsible Claim We can understand *some* things about *some* models with *some* confidence. We cannot yet understand *all* things about *any* model with *high* confidence. The gap matters for safety. ::: ## The Scaling Problem Our best successes are on small models. GPT-2 Small (124M parameters) has yielded beautiful, complete circuits: IOI, induction heads, greater-than comparison. Large models (70B+ parameters) are different. ### What Changes at Scale **More components**: GPT-2 Small has 12 layers × 12 heads = 144 attention heads. GPT-4 has an estimated 120 layers × 128 heads = 15,360+ heads. Manual inspection doesn't scale. (If you spent one minute understanding each head in GPT-4, you'd need 256 hours—over six 40-hour work weeks—just for the attention heads. And you haven't touched the MLPs.) **More redundancy**: Large models have backup circuits and distributed representations. Ablating one head has minimal effect because others compensate. This makes circuits harder to isolate. **More composition**: Complex behaviors emerge from many components working together. The IOI circuit had 26 heads in 7 functional groups. A reasoning circuit in GPT-4 might involve hundreds of components across dozens of layers. **Different algorithms**: Large models might implement qualitatively different algorithms than small models. The induction head that works beautifully in GPT-2 might be replaced by something more sophisticated in GPT-4. ### Current Status SAEs have been applied to Claude 3 Sonnet (a production-scale model), finding millions of interpretable features. This is genuine progress. But feature extraction isn't circuit analysis. We don't have complete circuit diagrams for any complex capability in any large model. The scaling gap is real. ### What's Needed - **Automated circuit discovery** that scales beyond thousands of components - **Abstraction methods** that summarize many components as single functional units - **Transfer learning** for interpretability (apply findings from small models to large ones) ::: {.callout-note} ## The Bitter Lesson Applied Compute has historically solved problems that manual effort couldn't. Perhaps interpretability will follow the same pattern: automated methods that scale with model size, rather than human-intensive analysis. ::: ## The Coverage Problem How much of model behavior can we explain? ### Explained vs. Unexplained For induction heads: high coverage. We understand the algorithm, the components, the composition. The circuit explains ~70-85% of in-context learning performance. For general language modeling: low coverage. What circuit produces creative writing? What circuit decides whether a joke is funny? What circuit determines that a political statement might be controversial? Most of what language models do remains unexplained. The techniques work—but they've been applied to a tiny fraction of model capabilities. ### The Long Tail Capabilities follow a long-tail distribution: - A few common patterns (copying, retrieval, pattern completion) - Many rare capabilities (specific facts, unusual reasoning, creative synthesis) Current interpretability succeeds on the head of the distribution—common, stereotyped behaviors. The tail is harder: each capability may have its own circuit, requiring individual analysis. **Implication**: Full interpretability might require understanding millions of circuits, not dozens. ### The Compositionality Challenge Complex behaviors aren't single circuits—they're compositions of circuits: "Explain why the French Revolution happened" requires: - Historical knowledge retrieval - Causal reasoning - Narrative construction - Language generation - Audience modeling Each component might be its own circuit. Understanding the composition—how circuits interact to produce coherent explanations—is harder than understanding circuits in isolation. ## The Validation Problem How do we know our interpretations are correct? ### The Ground Truth Problem For toy models: we know the ground truth (we created it). We can verify that the network learned the intended features. For language models: there is no ground truth. We infer features from behavior, then check whether those features explain behavior. This is circular—confirmation bias is a constant risk. ### Multiple SAEs, Different Features Train two SAEs on the same activations with different random seeds. They produce different features. Which is "right"? Both explain the activations. Both have interpretable features. But they're not identical. **Implication**: SAE features might be artifacts of the extraction process, not genuine properties of the model. ### Causal vs. Correlational Patching and ablation provide causal tests. But even causal evidence has limits: - **Sufficiency vs. necessity**: A component might be sufficient for a behavior without being how the model normally does it. Backup circuits mean patching might restore behavior through abnormal paths. - **Distribution shift**: Interventions create out-of-distribution activations. The measured causal effect might differ from the natural effect under normal operation. - **Emergent effects**: Removing component A might change how component B behaves, confounding interpretation. ### What Would Validation Look Like? Stronger validation might require: - **Predictions on held-out behaviors**: "This circuit should also explain X, Y, Z" - **Intervention success**: "Modifying this circuit should change behavior in predicted ways" - **Cross-model consistency**: "Similar models should have similar circuits" Current interpretability has some of these, but not systematically. ::: {.callout-important} ## The Falsifiability Question Is "the IOI circuit is the mechanism for indirect object identification" falsifiable? What evidence would prove it wrong? If we can't answer this clearly, interpretability risks being unfalsifiable—which would undermine its scientific status. ::: ::: {.callout-note} ## A Positive Development: Minimal Circuit Research Recent work (2024) has made progress on this question. Researchers found that purpose-built minimal models can solve IOI with just 2 attention heads—far fewer than the 26 found in GPT-2 Small. This comparison helps distinguish between "how this model does it" (26 heads, with redundancy) and "what's minimally necessary" (2 heads). Studying minimal circuits alongside pretrained circuits may provide the falsifiability we need. ::: ## The Alignment Tax Interpretability is expensive. Is it worth it? ### The Cost **Computational cost**: SAE training requires billions of tokens of cached activations. Patching requires many forward passes. Full circuit analysis takes weeks of GPU time. **Human cost**: Interpreting features requires human judgment. Circuit verification requires expert analysis. This doesn't scale to millions of features. **Capability cost**: Time spent on interpretability is time not spent on capability improvement. Organizations must choose. ### The Benefit The safety case for interpretability: - Detect deceptive alignment before deployment - Identify unsafe behaviors at the representation level - Verify that safety training actually changed the right things - Enable targeted intervention on problematic behaviors But these benefits are theoretical. No major safety incident has been prevented by interpretability. The counterfactual is hard to establish. ### The Alignment Tax Question If interpretability is very expensive and provides uncertain benefits, will organizations invest in it? Market pressures push toward capability. Interpretability is a cost with unclear return on investment. This creates incentives to skip or minimize interpretability work. **Policy implication**: External requirements (regulation, standards, liability) may be necessary to ensure interpretability investment. ## Safety Applications: Current State Despite the challenges, interpretability has concrete safety applications today—and more ambitious goals for the future. ### What's Already Possible **Feature-based safety classifiers**: Use SAE features to detect concerning content: ```python # Conceptual: Check if input activates "harmful" features harmful_features = [4721, 8903, 12456] # Hypothetical indices activations = sae.encode(model_activations) safety_score = sum(activations[f] for f in harmful_features) if safety_score > threshold: flag_for_review() ``` This is more interpretable than a black-box classifier—you can inspect *which* features triggered. **Steering for safer outputs**: Suppress features associated with harmful behavior during generation (see Chapter 9's steering section). Unlike fine-tuning, this is reversible and inspectable. **Behavioral verification**: After safety training (RLHF, Constitutional AI), use interpretability to check what changed: - Did the "deception" feature's activation pattern change? - Did new safety-related features emerge? - Are the changes localized or distributed? ### Safety-Relevant Features Discovered Anthropic's SAE work on Claude found features for: | Feature Type | Example | Potential Application | |--------------|---------|----------------------| | Deception | "Being dishonest or misleading" | Detect deceptive reasoning | | Sycophancy | "Agreeing with user despite knowing better" | Monitor for sycophantic drift | | Harmful content | Violent scenarios, exploitation | Content filtering | | Uncertainty | "I don't know" indicators | Calibration monitoring | | Unsafe code | Security vulnerabilities | Code safety | | Power-seeking | Goal preservation, influence | Alignment monitoring | These are early findings—more systematic mapping of safety-relevant features is ongoing. ### The Safety Research Agenda **Near-term** (currently tractable): 1. Catalog safety-relevant features across models 2. Build feature-based monitoring dashboards 3. Use steering for targeted safety interventions 4. Verify that safety training modifies the right features **Medium-term** (requires progress on open problems): 1. Detect deceptive alignment before deployment 2. Verify absence of concerning capabilities 3. Understand how safety properties emerge in training 4. Build interpretability-verified safety constraints **Long-term** (aspirational): 1. Formal proofs of safety properties from interpretability 2. Real-time interpretability during deployment 3. Interpretability as a core component of AI governance ::: {.callout-important} ## The Honest Assessment Interpretability has found safety-relevant features. We can steer and monitor. But: - We can't yet prove a model *lacks* deceptive capabilities - Features might hide deceptive computations (see "Deceptive Interpretability" warning above) - Coverage is incomplete—most capabilities aren't analyzed - Scale remains a challenge for production models Interpretability is a useful safety tool today. It is not yet sufficient for strong safety guarantees. ::: ## Superposition: Friend or Enemy? Superposition is central to how neural networks work. It's also central to why interpretability is hard. ### The Dilemma Superposition enables compression: millions of features in thousands of dimensions. This is why language models are so capable—they pack vast knowledge into manageable parameter counts. But superposition creates polysemanticity: each neuron, each activation dimension, encodes multiple features. This is why interpretability is hard—there's no clean mapping from components to concepts. ### SAEs: The Current Solution Sparse autoencoders "undo" superposition, recovering monosemantic features from polysemantic activations. But SAEs are imperfect: - 10-40% reconstruction error means some information is lost or never captured - Features split and absorb as dictionary size changes - No ground truth for whether SAE features match "true" model features ### Could We Train Without Superposition? What if we trained models to avoid superposition—to use monosemantic representations from the start? **Attempts**: Softmax linear units (SoLU), dictionary learning during training, sparsity constraints. **Results**: Promising in small experiments. Large-scale viability unknown. **Trade-off**: If superposition is computationally optimal (as toy models suggest), avoiding it might require larger models for the same capability—a capability tax for interpretability. ## Emergent Capabilities Large language models exhibit capabilities that weren't explicitly trained: - Arithmetic (up to some limit) - Translation between languages not paired in training data - Reasoning about novel scenarios - Theory of mind (modeling other agents' beliefs) ### The Emergence Problem Where do these capabilities come from? They weren't explicitly optimized. They emerged as side effects of language modeling. Current interpretability has trouble with emergence: - We can't predict which capabilities will emerge at which scale - We can't explain *why* language modeling produces arithmetic - We can't verify whether an emergent capability is robust or brittle ### The Safety Implications If capabilities emerge unpredictably, so might: - Deceptive behaviors - Power-seeking tendencies - Goal generalization beyond training Interpretability would ideally detect these emergent problems. Currently, we can't—we don't know where to look until after the capability appears. ::: {.callout-note} ## The Detection Challenge Can we build interpretability tools that detect *unknown* capabilities or behaviors? This requires moving beyond "understand the circuit for X" to "find any circuits with concerning properties"—a harder problem. ::: ::: {.callout-warning} ## Deceptive Interpretability (2024-2025 Concern) A more troubling possibility: what if interpretability methods themselves can be fooled? Recent theoretical and empirical work explores scenarios where models could develop representations that *appear* interpretable but hide actual computations. Features that seem to track "honesty" might not actually govern the model's behavior. This isn't paranoia—it's a genuine methodological concern. The proxy metric gap (SAEBench, 2025) showed that interpretability metrics don't reliably predict practical utility, raising deeper questions about what our tools actually measure. ::: ## Specific Open Questions Beyond these broad challenges, specific technical questions remain open: ### 1. What's the Right Level of Abstraction? Circuits can be described at many levels: - Individual neurons - Feature directions - Attention head functions - Layer-wise transformations - Abstract algorithms Which level is "right"? Different levels may be appropriate for different questions. But we lack a principled way to choose. ### 2. How Do We Handle Continuous Features? Toy models treat features as binary (on/off). Real features are continuous and graded. How do continuous features compose? When does "slightly active" become "importantly active"? ### 3. What About Temporal Dynamics? Our techniques analyze single forward passes. But many capabilities develop over sequences: - Context building across dialogue - Refinement through iteration - Planning over multiple outputs How do we interpret dynamics, not just snapshots? ### 4. Can We Interpret Training? Interpretability focuses on trained models. But safety might require understanding the *training process*: - Which examples teach which capabilities? - How do circuits form during training? - Can we predict training outcomes from early signals? Training dynamics are much less understood than forward pass dynamics. ### 5. How Do We Scale Interpretation? Even with SAEs finding millions of features, interpreting them requires human attention. How do we: - Automatically label features (current LLM-based methods are imperfect) - Find features relevant to specific behaviors - Summarize feature sets at higher abstraction levels ### 6. What's the Theory? We have empirical observations (polytopes, phase transitions, induction heads) but limited theoretical understanding: - Why do transformers represent features as directions? - What determines which circuits emerge during training? - Is there a computational theory of interpretation? ## The Path Forward Despite these problems, progress is possible. ### Near-Term **Automated circuit discovery**: Tools like ACDC that scale patching to larger models. **Better SAEs**: Architectures with lower reconstruction error, less absorption, more consistent features. **Benchmark development**: Standardized evaluations for interpretability methods—like ImageNet for circuit analysis. ### Medium-Term **End-to-end interpretability**: Training models with built-in interpretability constraints. **Formal verification**: Mathematical proofs of model properties, not just empirical observations. **Integration with safety**: Using interpretability for practical safety applications (detecting deception, verifying alignment). ### Long-Term **Complete model understanding**: The ability to fully explain any model output in terms of interpretable features and circuits. **Predictive interpretability**: Understanding training well enough to predict model properties before training. **Interpretability by design**: AI architectures that are interpretable from the start, without post-hoc analysis. ## Polya's Perspective: Acknowledging Unknowns Polya emphasizes: **"Understand what you don't understand."** Intellectual progress requires honestly identifying gaps. This chapter maps the gaps in mechanistic interpretability—not to discourage work, but to focus it. The problems are real. The problems are hard. But problems clearly stated are problems that can be worked on. ::: {.callout-tip} ## Polya's Insight "What is unknown?" is as important as "what is known." Honest acknowledgment of limitations guides research toward the most important problems. The open questions in this chapter are the research agenda for the field. ::: ## Looking Ahead The final chapter offers a **Practice Regime**—concrete guidance for actually doing interpretability research: - How to choose problems - How to structure experiments - How to debug circuits that don't work - How to publish and share findings This series has been theory and concepts. The next chapter is about practice. --- ## Further Reading 1. **200 Concrete Open Problems in Mechanistic Interpretability** — [Neel Nanda](https://www.neelnanda.io/mechanistic-interpretability/200-concrete-problems): Exhaustive list of specific research questions. 2. **Towards Monosemanticity (Limitations Section)** — [Anthropic](https://transformer-circuits.pub/2023/monosemantic-features): Honest discussion of SAE limitations by the developers. 3. **Causal Scrubbing: Rigorous Circuit Evaluation** — [Redwood Research](https://www.lesswrong.com/posts/JvZhhzycHu2Yd57RN/causal-scrubbing-a-method-for-rigorously-testing): Attempts to address the validation problem. 4. **SAEBench: Evaluating SAEs on Practical Tasks** — [arXiv](https://arxiv.org/abs/2407.12935): Comprehensive benchmark revealing the proxy metric gap. 5. **Interp Benchmarks (Proposal)** — [Various](https://www.lesswrong.com/posts/dPpPvwEqDMxMvGKhP/interp-benchmarks-proposal): Efforts toward standardized interpretability evaluation. 6. **Softmax Linear Units for Interpretability** — [Anthropic](https://transformer-circuits.pub/2022/solu/index.html): Attempts to train more interpretable models from scratch. 7. **Scaling Interpretability Research** — [MATS](https://www.matsprogram.org/): Programs for training interpretability researchers and scaling the field.