flowchart LR
A["Activation<br/>(768 dims)"] --> E["Encoder<br/>W_enc × x + b"]
E --> R["ReLU"]
R --> L["Sparse Latent<br/>(16,384 dims)<br/>~300 active"]
L --> D["Decoder<br/>W_dec × z"]
D --> O["Reconstruction<br/>(768 dims)"]
14 Sparse Autoencoders
Extracting features from superposition
- What sparse autoencoders are and how they work
- Why overcomplete + sparse = decomposition of superposition
- The training objective: reconstruction vs. sparsity trade-off
- How to interpret and validate SAE features
Required: Chapter 6: Superposition — understanding why features are compressed into overlapping representations
From Arc II (Core Theory), recall:
- Features are directions in activation space (Chapter 5)
- Superposition compresses many features into few dimensions (Chapter 6)
- This creates polysemantic neurons that are hard to interpret
- Circuits are how features compose into algorithms (Chapter 8)
We’ve established the theory. Now we ask: How do we actually find features in real networks?
14.1 The Tool We Need
In 2024, Anthropic researchers did something remarkable: they made Claude believe it was the Golden Gate Bridge.
Not through prompt engineering. Not through fine-tuning. They found a single direction in Claude’s activation space—one of millions of learned features—and amplified it. Suddenly, Claude inserted the bridge into every response. Asked about itself, it described being a suspension bridge spanning the San Francisco Bay.
This wasn’t a trick. The researchers had found an interpretable feature—a direction in the neural network’s internal representation that corresponded to a specific concept. By manipulating it, they could predictably control the model’s behavior.
How did they find this feature among millions of dimensions? With a sparse autoencoder (SAE).
The Golden Gate Bridge demonstration showed that neural networks have interpretable structure we can find and manipulate. SAEs are the tool that makes this possible—they decompose compressed, polysemantic activations into individual, monosemantic features.
14.2 What Is a Sparse Autoencoder?
An autoencoder is a neural network trained to reconstruct its input. It has two parts: - Encoder: Maps input to a latent representation - Decoder: Maps latent representation back to input
The training objective is simple: make the output match the input.
A sparse autoencoder adds a crucial constraint: the latent representation must be sparse—most of its values must be zero.
14.2.1 The Architecture
For a transformer with \(d\) dimensions (say, 768), a sparse autoencoder might have:
Encoder: Maps \(d\)-dimensional activation to \(D\)-dimensional latent space, where \(D >> d\) - Typical ratio: \(D\) is 4-24× larger than \(d\) - For \(d = 768\), we might have \(D = 16,384\) or more
Sparsity mechanism: Only a small number of latent dimensions activate (are non-zero) for any given input—typically fewer than 300 out of 16,384
Decoder: Maps \(D\)-dimensional sparse representation back to \(d\) dimensions
14.2.2 Why Overcomplete?
Regular autoencoders use a bottleneck—a latent space smaller than the input, forcing compression. SAEs do the opposite: the latent space is larger than the input.
This seems backwards. Why make the latent space bigger?
Because we’re not trying to compress—we’re trying to decompose. We believe the 768-dimensional activation actually represents thousands of features in superposition. We give the SAE room to spread those features out into separate dimensions.
The sparsity constraint ensures this expansion doesn’t become trivial. The SAE can’t just copy the input to random latent dimensions. It must find a sparse basis where each input activates only a few latent features.
Think of SAEs like profiler stack sampling, but for representations. A profiler takes a complex execution and decomposes it into individual function calls. An SAE takes a complex activation and decomposes it into individual features. Both reveal hidden structure through systematic decomposition.
Regular autoencoders compress to a smaller space. SAEs expand to a larger space. Why does this expansion help with interpretability?
Hint: Think about what the input activation actually represents (from Chapter 6).
14.3 The Training Objective
SAEs are trained with two competing objectives:
1. Reconstruction: The output should match the input \[\mathcal{L}_{\text{recon}} = ||x - \hat{x}||^2\]
2. Sparsity: The latent representation should have mostly zeros \[\mathcal{L}_{\text{sparse}} = ||z||_1\]
The total loss is: \[\mathcal{L} = \mathcal{L}_{\text{recon}} + \lambda \cdot \mathcal{L}_{\text{sparse}}\]
where \(\lambda\) controls the trade-off.
14.3.1 The Trade-off
This is the fundamental tension:
- Low \(\lambda\): Accurate reconstruction, but dense (non-sparse) latents → less interpretable
- High \(\lambda\): Very sparse latents → more interpretable, but worse reconstruction
In practice, we accept 10-40% reconstruction error to achieve sufficient sparsity. This means SAEs don’t perfectly capture everything in the activation—but what they capture is interpretable.
You cannot have both perfect reconstruction and perfect interpretability. Some information in neural network activations may be fundamentally non-sparse—distributed in ways that resist decomposition. SAEs give us interpretable features at the cost of missing some signal.
14.4 What SAEs Produce
After training, an SAE gives you two things:
14.4.1 1. A Feature Dictionary
The decoder weights form a dictionary of feature directions. Each column of the decoder matrix is a feature vector—a direction in the original activation space.
If the SAE has 16,384 latent dimensions, you get 16,384 feature vectors. Each represents a potential concept the model might use.
14.4.2 2. Sparse Activations
For any input activation, the SAE produces a sparse vector of feature activations. If features 47, 892, and 3,041 are active (non-zero), those are the features present in this input.
The magnitude of each activation tells you how strongly that feature is present.
14.4.3 Connection to Toy Models
Remember the pentagon from Chapter 7? In that toy model, 5 sparse features arranged themselves as a regular pentagon in 2D—the optimal geometry for minimizing interference.
SAEs are doing the same thing, but in reverse:
| Toy Models | SAEs |
|---|---|
| We define features, watch network learn geometry | Network has learned geometry, we recover features |
| 5 features → pentagon in 2D | Millions of features → complex polytope in 768D |
| Ground truth known | Ground truth unknown |
| Validates superposition theory | Applies superposition theory |
The toy model shows what should happen under ideal conditions. SAEs attempt to undo what happened in real networks. When an SAE finds features that are nearly orthogonal and activate sparsely, it’s recovering the structure that toy models predict.
Toy models predict that sparse features should arrange as near-orthogonal directions. SAE decoder columns (feature vectors) can be checked for this property. Finding that SAE features are indeed nearly orthogonal validates both the SAE and the superposition hypothesis.
In practice, SAE features in large models show: - High sparsity: <300 features active per token (out of millions) - Near-orthogonality: Decoder columns have low pairwise cosine similarity - Polytope-like structure: Features cluster into interpretable groups
This is exactly what toy models predict—scaled up by orders of magnitude.
14.5 Interpreting Features
How do you know what a feature means?
14.5.1 Max-Activating Examples
The primary method: find inputs that maximally activate each feature.
- Run many inputs through the SAE
- For each feature, record which inputs caused it to activate most strongly
- Look at those inputs—what do they have in common?
If feature 892 activates most strongly on text about cooking, mentions of recipes, and kitchen descriptions, it’s probably a “cooking” feature.
14.5.2 Feature Steering
A more rigorous test: artificially boost or suppress the feature and observe behavior changes.
If amplifying feature 892 makes the model insert cooking references into unrelated text, you’ve verified it’s causally connected to “cooking.”
The famous “Golden Gate Bridge” feature was discovered this way: amplifying it made Claude identify as the Golden Gate Bridge.
Wrong: “SAE features are the actual features the model uses internally.”
Right: SAE features are a reconstruction that’s convenient for us to interpret. The real model may use different bases, or may not have discrete “features” at all in some representations. SAE features are useful proxies—they help us understand the model—but they’re not necessarily how the model “thinks.” The 10-40% reconstruction error reminds us that SAEs are approximations.
14.5.3 Automated Interpretation
For millions of features, manual inspection doesn’t scale. Researchers use: - LLM-based feature labeling (ask another model to describe max-activating examples) - Semantic clustering (group similar features) - Activation correlation analysis
14.6 Anthropic’s Scaling Results
In 2024, Anthropic applied SAEs to Claude 3 Sonnet at unprecedented scale, training SAEs with up to 16 million features.
14.6.1 What They Found
- Millions of interpretable features across all layers
- Features for concepts ranging from specific entities (“Golden Gate Bridge”) to abstract properties (“deception,” “uncertainty”)
- Features for code patterns, mathematical concepts, linguistic phenomena
- Safety-relevant features: “backdoor,” “unsafe code,” “sycophancy”
14.6.2 Key Insights
Features cluster semantically: Similar features are geometrically nearby. Safety-related features cluster together. Domain knowledge features cluster together.
Features split with scale: Larger SAEs find finer-grained features. “Animals” might split into “mammals,” “birds,” “reptiles” with more capacity.
Features are sparse: Fewer than 300 features active per token, out of millions available.
Features transfer: Some features discovered on text also activate appropriately on images (for multimodal models).
Neuronpedia hosts SAE features for multiple models. Explore what interpretable features look like:
- Visit neuronpedia.org/gpt2-small for GPT-2 Small features
- Try the “Random Feature” button to see a randomly selected feature
- Look at the max-activating examples—can you guess what the feature represents?
- Search for features related to concepts like “programming” or “sentiment”
- Compare early-layer vs. late-layer features—notice how abstraction increases with depth
This is what SAE output looks like in practice. Each feature is a direction in activation space; Neuronpedia shows you what each direction “means.”
Before SAEs: We knew polysemanticity existed but couldn’t see through it After SAEs: We can decompose activations into millions of interpretable features
This is the tool that makes large-scale interpretability research possible.
14.7 Practical Considerations
14.7.1 Where to Apply SAEs
Residual stream: The most common target. Contains the accumulated signal that all components read from and write to. Features here are often highly interpretable.
Layer selection: Different layers have different feature types: - Early layers: Syntactic features, token patterns, formatting - Middle layers: Semantic features, concepts, relationships - Late layers: Task-specific features, output-relevant information
Attention outputs: Can also be productive, capturing input-output relationships specific to attention heads.
14.7.2 Computational Cost
Activation caching: First, run the base model on a large dataset (billions of tokens) and save activations. This is a one-time cost.
SAE training: With cached activations, training takes hours to days depending on SAE size. Recent work achieves training in under 30 minutes with optimized pipelines.
Memory: Proportional to dictionary size. A 16M feature SAE requires substantial GPU memory.
14.7.3 Key Hyperparameters
Dictionary size (\(D\)): Larger dictionaries find more features but cost more and may have more “dead” features that never activate.
Sparsity coefficient (\(\lambda\)): Higher values produce sparser (more interpretable) features but worse reconstruction. Typical values: 1-100.
Training data: Diverse data produces more general features. Billions of tokens recommended for convergence.
14.8 Limitations and Failure Modes
SAEs aren’t perfect. Understanding their limitations is essential.
14.8.1 Reconstruction Error
SAEs typically achieve 60-90% reconstruction accuracy. The missing 10-40% represents: - Noise that doesn’t correspond to interpretable features - Information that’s truly distributed (not sparse) - Features the SAE failed to learn
This limits what we can conclude from SAE features alone.
14.8.2 Dead Features
Some latent dimensions never activate during training—they’re “dead.” Rates of 5-20% are common. These represent wasted capacity and may indicate the dictionary is too large.
Recent approaches (TopK SAEs, auxiliary losses) reduce dead features.
14.8.3 Feature Splitting
As you increase dictionary size, broad features split into finer-grained sub-features: - “Math” → “algebra,” “geometry,” “calculus” - “Animals” → “mammals,” “birds,” “reptiles”
This is sometimes useful (more precision) but sometimes problematic (which granularity is “correct”?).
14.8.4 Feature Absorption
A more concerning failure mode: when dictionary size increases, some features absorb others instead of splitting cleanly. A very common concept (like “Paris”) might absorb a broader concept (like “European capitals”), causing the broader feature to stop firing where it should.
Recent research (2024) has shown that feature absorption is more fundamental than initially understood. When features form hierarchies (parent-child relationships), the decomposition becomes theoretically unstable. This isn’t a bug to be fixed with better hyperparameters—it’s a structural challenge that may require fundamentally new approaches. Varying SAE sizes or sparsity penalties is insufficient.
14.8.5 No Ground Truth
The most fundamental limitation: we don’t know what the “true” features are. We can’t verify that SAE features are the same features the model “actually uses.”
Multiple SAEs trained on the same data produce somewhat different features. Which is right? We don’t know.
SAEs produce interpretable features, but interpretability isn’t proof of correctness. The features might be artifacts of the SAE training process rather than genuine properties of the original model. This is an open problem.
14.9 Alternatives and Improvements
14.9.1 TopK SAEs
Instead of an L1 penalty that gradually suppresses activations, TopK SAEs keep exactly the \(k\) largest activations and zero the rest.
Advantages: Direct control over sparsity, cleaner thresholding Disadvantages: Non-differentiable, requires auxiliary losses for dead features
Recent work shows TopK SAEs can achieve better reconstruction-sparsity trade-offs.
14.9.2 Gated SAEs
Separate the decision “which features to use” from “how much to use each”: - Gate network: Binary decision of which features activate - Magnitude network: How strongly each feature activates
This eliminates “shrinkage” (systematic underestimation of feature magnitudes) and reduces active feature count.
Recent advances (2024): Gated SAEs achieve half as many firing features to reach comparable reconstruction fidelity compared to standard L1-penalized SAEs. This represents a significant improvement to the reconstruction-sparsity frontier while maintaining interpretability.
14.9.3 Other Approaches
- JumpReLU SAEs (2024): Use discontinuous activation functions with straight-through estimators to directly optimize L0 sparsity. Achieves state-of-the-art reconstruction on models like Gemma 2 9B.
- Switch SAEs: Use mixture-of-experts routing between smaller “expert” SAEs for computational efficiency at scale.
- Matryoshka SAEs: Train across multiple sparsity levels simultaneously, showing strong feature disentanglement.
- Contrastive losses: Encourage features to be more distinct from each other.
Recent comprehensive benchmarking (SAEBench, 2025) reveals a troubling finding: improvements on traditional proxy metrics (reconstruction loss, sparsity, interpretability scores) don’t reliably translate to practical performance on downstream tasks. Some SAE variants that excel on proxy metrics underperform on real applications, while others (like Matryoshka SAEs) underperform on metrics but excel at feature disentanglement. This suggests the field needs better evaluation methods.
14.10 Using SAE Features
Once you have features, what can you do with them?
14.10.1 Understanding
The most basic use: look at what features activate for specific inputs. If you want to understand how the model processes “The capital of France is ___“, examine which features fire.
This provides a vocabulary for discussing model internals: “Features 47, 892, and 3,041 are active” is more meaningful than “Neuron 234 has value 0.72.”
14.10.2 Steering
Manipulate features during generation: - Amplify a feature → model emphasizes that concept - Suppress a feature → model avoids that concept
More targeted than prompt engineering, because you’re intervening on the model’s internal representations.
Steering is one of the most powerful applications of SAE features—you can modify model behavior by directly adjusting its internal representations.
14.10.3 The Basic Recipe
- Find a relevant feature: Use max-activating examples to identify a feature for the concept you want to influence
- Extract the feature direction: Get the decoder column for that feature
- Add or subtract during inference: Inject the direction into the residual stream
14.10.4 Conceptual Example
# 1. Find the "sycophancy" feature (hypothetical index 4721)
sycophancy_direction = sae.decoder.weight[:, 4721]
# 2. During generation, subtract to reduce sycophantic behavior
def steering_hook(residual, hook):
# Subtract the sycophancy direction (scaled)
return residual - 3.0 * sycophancy_direction
# 3. Run with the intervention
with model.hooks(fwd_hooks=[("blocks.15.hook_resid_post", steering_hook)]):
output = model.generate(prompt)14.10.5 Steering Strength
The multiplier (3.0 above) controls intervention strength: - Too weak: No noticeable effect - Just right: Behavior shifts in the desired direction - Too strong: Coherence breaks down, bizarre outputs
Finding the right strength requires experimentation. Start with 1x the feature’s typical activation magnitude, then adjust.
14.10.6 When Steering Works Well
| Use Case | Why It Works |
|---|---|
| Style adjustment | Style features are often localized |
| Content emphasis | Semantic features steer topic focus |
| Safety guardrails | Suppressing harmful content features |
| Debugging | Verify a feature’s causal role |
14.10.7 When Steering Struggles
- Distributed capabilities: If behavior involves many features, steering one doesn’t suffice
- Feature absorption: The “right” feature might have absorbed into a related one
- Interference: Steering one feature may unexpectedly affect others
14.10.8 The Golden Gate Bridge Example
Anthropic’s famous demonstration amplified a “Golden Gate Bridge” feature in Claude. The model began inserting bridge references everywhere, even claiming to be the bridge. This dramatically illustrated both the power and the strangeness of feature-level intervention.
Key insight: The model doesn’t “resist” the intervention—it tries to make coherent sense of activations that now strongly include “Golden Gate Bridge,” even in contexts where that makes no sense.
| Property | Steering | Fine-tuning |
|---|---|---|
| Persistence | Per-inference only | Permanent |
| Precision | Single feature | Many weights |
| Reversibility | Instant | Requires retraining |
| Compute | Negligible | Expensive |
| Interpretability | Transparent | Opaque |
Steering is surgical intervention; fine-tuning is systemic change. For exploration and debugging, steering is often preferable.
14.10.9 Circuit Discovery
SAE features provide a cleaner basis for tracing circuits: - Which features cause which downstream features? - How do features combine across layers? - What’s the path from input features to output features?
With monosemantic features, circuit discovery becomes more tractable than with polysemantic neurons.
14.10.10 Safety Applications
Identify features related to: - Harmful content - Deception - Sycophancy - Unsafe code patterns
Potentially suppress these features during deployment, or flag inputs where they activate strongly.
Explore real SAE features interactively at Neuronpedia. Browse features discovered in GPT-2, Gemma 2, Llama 3, and other models. Search for concepts like “code,” “emotion,” or “deception” to see what features researchers have found. Each feature shows max-activating examples, helping you build intuition for what SAEs actually discover.
14.11 Polya’s Perspective: Auxiliary Constructions
In Polya’s framework, the SAE is an auxiliary construction—something we add to make the problem tractable.
The original problem: understand what’s represented in a 768-dimensional activation vector packed with superposed features.
The auxiliary construction: train an SAE that decomposes the vector into a sparse combination of learned features.
The SAE isn’t part of the original model. It’s a tool we build to make the model’s internals visible. Like adding a construction line in geometry, it doesn’t change the underlying truth—it reveals structure that was always there.
“Introduce auxiliary elements.” When a problem is too hard to solve directly, add something that makes the structure visible. SAEs are auxiliary elements for interpretability: we add them to see what was hidden.
14.12 Looking Ahead
We now have a tool for extracting features. But features in isolation don’t explain model behavior. We need techniques for: - Attribution: Which features contributed to this output? (Chapter 10) - Patching: Is this feature causally necessary? (Chapter 11) - Ablation: What happens if we remove this component? (Chapter 12)
SAEs give us the vocabulary (features). The next chapters give us the grammar (how to trace, verify, and understand feature contributions).
14.13 Common Confusions
SAE features are one decomposition of the model’s representations, not necessarily the decomposition. Different SAE training runs produce different features. The model doesn’t “know about” your SAE—you’re projecting a structure onto its activations. Treat SAE features as useful hypotheses, not ground truth.
Larger dictionaries capture more features but suffer from splitting (one concept becomes many redundant features) and absorption (common features absorb rare ones). There’s a sweet spot that depends on the model and layer. Typically 16x-32x expansion is good; 128x may be overkill.
Reconstruction loss measures how well the SAE approximates the original activations, not how interpretable or useful the features are. An SAE could achieve low loss with uninterpretable features. Always evaluate interpretability and downstream task performance, not just loss.
Steering shows that adding this direction changes behavior in a consistent way. It doesn’t prove the model “uses” this feature naturally. You’re demonstrating causal influence, not necessarily that the direction matches the model’s internal representation. Steering is evidence, not proof.
14.14 Further Reading
Towards Monosemanticity — Anthropic: The foundational paper introducing SAEs for interpretability.
Scaling Monosemanticity — Anthropic: Applying SAEs to Claude 3 Sonnet, finding millions of interpretable features.
Sparse Autoencoders Find Highly Interpretable Features — arXiv:2309.08600: Technical details on SAE training and evaluation.
A is for Absorption — arXiv:2409.14507: Analysis of feature splitting and absorption failure modes.
Improving Sparse Decomposition with Gated SAEs — NeurIPS 2024: The gated SAE variant that reduces shrinkage.
Neuronpedia — neuronpedia.org: Interactive explorer for SAE features across various models.