14 Sparse Autoencoders

Extracting features from superposition

techniques

sparse-autoencoders

Author

Taras Tsugrii

Published

January 5, 2025

Hands-On Notebook

Extract features from GPT-2, explore what they represent, try steering the model.

What You’ll Learn

What sparse autoencoders are and how they work
Why overcomplete + sparse = decomposition of superposition
The training objective: reconstruction vs. sparsity trade-off
How to interpret and validate SAE features

Prerequisites

Required: Chapter 6: Superposition — understanding why features are compressed into overlapping representations

Before You Read: Recall

From Arc II (Core Theory), recall:

Features are directions in activation space (Chapter 5)
Superposition compresses many features into few dimensions (Chapter 6)
This creates polysemantic neurons that are hard to interpret
Circuits are how features compose into algorithms (Chapter 8)

We’ve established the theory. Now we ask: How do we actually find features in real networks?

14.1 The Tool We Need

In 2024, Anthropic researchers did something remarkable: they made Claude believe it was the Golden Gate Bridge.

Not through prompt engineering. Not through fine-tuning. They found a single direction in Claude’s activation space—one of millions of learned features—and amplified it. Suddenly, Claude inserted the bridge into every response. Asked about itself, it described being a suspension bridge spanning the San Francisco Bay.

This wasn’t a trick. The researchers had found an interpretable feature—a direction in the neural network’s internal representation that corresponded to a specific concept. By manipulating it, they could predictably control the model’s behavior.

How did they find this feature among millions of dimensions? With a sparse autoencoder (SAE).

What Just Happened

The Golden Gate Bridge demonstration showed that neural networks have interpretable structure we can find and manipulate. SAEs are the tool that makes this possible—they decompose compressed, polysemantic activations into individual, monosemantic features.

14.2 What Is a Sparse Autoencoder?

An autoencoder is a neural network trained to reconstruct its input. It has two parts: - Encoder: Maps input to a latent representation - Decoder: Maps latent representation back to input

The training objective is simple: make the output match the input.

A sparse autoencoder adds a crucial constraint: the latent representation must be sparse—most of its values must be zero.

14.2.1 The Architecture

For a transformer with $d$ dimensions (say, 768), a sparse autoencoder might have:

Encoder: Maps $d$-dimensional activation to $D$-dimensional latent space, where $D >> d$ - Typical ratio: $D$ is 4-24× larger than $d$ - For $d = 768$, we might have $D = 16,384$ or more

Sparsity mechanism: Only a small number of latent dimensions activate (are non-zero) for any given input—typically fewer than 300 out of 16,384

Decoder: Maps $D$-dimensional sparse representation back to $d$ dimensions

flowchart LR
    A["Activation<br/>(768 dims)"] --> E["Encoder<br/>W_enc × x + b"]
    E --> R["ReLU"]
    R --> L["Sparse Latent<br/>(16,384 dims)<br/>~300 active"]
    L --> D["Decoder<br/>W_dec × z"]
    D --> O["Reconstruction<br/>(768 dims)"]

SAE architecture: activations are expanded to a large sparse space, then reconstructed. Most latent dimensions are zero.

14.2.2 Why Overcomplete?

Regular autoencoders use a bottleneck—a latent space smaller than the input, forcing compression. SAEs do the opposite: the latent space is larger than the input.

This seems backwards. Why make the latent space bigger?

Because we’re not trying to compress—we’re trying to decompose. We believe the 768-dimensional activation actually represents thousands of features in superposition. We give the SAE room to spread those features out into separate dimensions.

The sparsity constraint ensures this expansion doesn’t become trivial. The SAE can’t just copy the input to random latent dimensions. It must find a sparse basis where each input activates only a few latent features.

A Performance Engineering Parallel

Think of SAEs like profiler stack sampling, but for representations. A profiler takes a complex execution and decomposes it into individual function calls. An SAE takes a complex activation and decomposes it into individual features. Both reveal hidden structure through systematic decomposition.

Pause and Think

Regular autoencoders compress to a smaller space. SAEs expand to a larger space. Why does this expansion help with interpretability?

Hint: Think about what the input activation actually represents (from Chapter 6).

14.3 The Training Objective

SAEs are trained with two competing objectives:

1. Reconstruction: The output should match the input \[\mathcal{L}_{\text{recon}} = ||x - \hat{x}||^2\]

2. Sparsity: The latent representation should have mostly zeros \[\mathcal{L}_{\text{sparse}} = ||z||_1\]

The total loss is: \[\mathcal{L} = \mathcal{L}_{\text{recon}} + \lambda \cdot \mathcal{L}_{\text{sparse}}\]

where $\lambda$ controls the trade-off.

14.3.1 The Trade-off

This is the fundamental tension:

Low $\lambda$: Accurate reconstruction, but dense (non-sparse) latents → less interpretable
High $\lambda$: Very sparse latents → more interpretable, but worse reconstruction

In practice, we accept 10-40% reconstruction error to achieve sufficient sparsity. This means SAEs don’t perfectly capture everything in the activation—but what they capture is interpretable.

The Reconstruction-Interpretability Trade-off

You cannot have both perfect reconstruction and perfect interpretability. Some information in neural network activations may be fundamentally non-sparse—distributed in ways that resist decomposition. SAEs give us interpretable features at the cost of missing some signal.

14.4 What SAEs Produce

After training, an SAE gives you two things:

14.4.1 1. A Feature Dictionary

The decoder weights form a dictionary of feature directions. Each column of the decoder matrix is a feature vector—a direction in the original activation space.

If the SAE has 16,384 latent dimensions, you get 16,384 feature vectors. Each represents a potential concept the model might use.

14.4.2 2. Sparse Activations

For any input activation, the SAE produces a sparse vector of feature activations. If features 47, 892, and 3,041 are active (non-zero), those are the features present in this input.

The magnitude of each activation tells you how strongly that feature is present.

14.4.3 Connection to Toy Models

Remember the pentagon from Chapter 7? In that toy model, 5 sparse features arranged themselves as a regular pentagon in 2D—the optimal geometry for minimizing interference.

SAEs are doing the same thing, but in reverse:

Toy Models	SAEs
We define features, watch network learn geometry	Network has learned geometry, we recover features
5 features → pentagon in 2D	Millions of features → complex polytope in 768D
Ground truth known	Ground truth unknown
Validates superposition theory	Applies superposition theory

The toy model shows what should happen under ideal conditions. SAEs attempt to undo what happened in real networks. When an SAE finds features that are nearly orthogonal and activate sparsely, it’s recovering the structure that toy models predict.

The Validation Loop

Toy models predict that sparse features should arrange as near-orthogonal directions. SAE decoder columns (feature vectors) can be checked for this property. Finding that SAE features are indeed nearly orthogonal validates both the SAE and the superposition hypothesis.

In practice, SAE features in large models show: - High sparsity: <300 features active per token (out of millions) - Near-orthogonality: Decoder columns have low pairwise cosine similarity - Polytope-like structure: Features cluster into interpretable groups

This is exactly what toy models predict—scaled up by orders of magnitude.

14.5 Interpreting Features

How do you know what a feature means?

14.5.1 Max-Activating Examples

The primary method: find inputs that maximally activate each feature.

Run many inputs through the SAE
For each feature, record which inputs caused it to activate most strongly
Look at those inputs—what do they have in common?

If feature 892 activates most strongly on text about cooking, mentions of recipes, and kitchen descriptions, it’s probably a “cooking” feature.

14.5.2 Feature Steering

A more rigorous test: artificially boost or suppress the feature and observe behavior changes.

If amplifying feature 892 makes the model insert cooking references into unrelated text, you’ve verified it’s causally connected to “cooking.”

The famous “Golden Gate Bridge” feature was discovered this way: amplifying it made Claude identify as the Golden Gate Bridge.

Common Misconception: SAE Features Are “Ground Truth”

Wrong: “SAE features are the actual features the model uses internally.”

Right: SAE features are a reconstruction that’s convenient for us to interpret. The real model may use different bases, or may not have discrete “features” at all in some representations. SAE features are useful proxies—they help us understand the model—but they’re not necessarily how the model “thinks.” The 10-40% reconstruction error reminds us that SAEs are approximations.

14.5.3 Automated Interpretation

For millions of features, manual inspection doesn’t scale. Researchers use: - LLM-based feature labeling (ask another model to describe max-activating examples) - Semantic clustering (group similar features) - Activation correlation analysis

14.6 Anthropic’s Scaling Results

In 2024, Anthropic applied SAEs to Claude 3 Sonnet at unprecedented scale, training SAEs with up to 16 million features.

14.6.1 What They Found

Millions of interpretable features across all layers
Features for concepts ranging from specific entities (“Golden Gate Bridge”) to abstract properties (“deception,” “uncertainty”)
Features for code patterns, mathematical concepts, linguistic phenomena
Safety-relevant features: “backdoor,” “unsafe code,” “sycophancy”

14.6.2 Key Insights

Features cluster semantically: Similar features are geometrically nearby. Safety-related features cluster together. Domain knowledge features cluster together.

Features split with scale: Larger SAEs find finer-grained features. “Animals” might split into “mammals,” “birds,” “reptiles” with more capacity.

Features are sparse: Fewer than 300 features active per token, out of millions available.

Features transfer: Some features discovered on text also activate appropriately on images (for multimodal models).

Try It Yourself: Explore SAE Features on Neuronpedia

Neuronpedia hosts SAE features for multiple models. Explore what interpretable features look like:

Visit neuronpedia.org/gpt2-small for GPT-2 Small features
Try the “Random Feature” button to see a randomly selected feature
Look at the max-activating examples—can you guess what the feature represents?
Search for features related to concepts like “programming” or “sentiment”
Compare early-layer vs. late-layer features—notice how abstraction increases with depth

This is what SAE output looks like in practice. Each feature is a direction in activation space; Neuronpedia shows you what each direction “means.”

The Transformation

Before SAEs: We knew polysemanticity existed but couldn’t see through it After SAEs: We can decompose activations into millions of interpretable features

This is the tool that makes large-scale interpretability research possible.

14.7 Practical Considerations

14.7.1 Where to Apply SAEs

Residual stream: The most common target. Contains the accumulated signal that all components read from and write to. Features here are often highly interpretable.

Layer selection: Different layers have different feature types: - Early layers: Syntactic features, token patterns, formatting - Middle layers: Semantic features, concepts, relationships - Late layers: Task-specific features, output-relevant information

Attention outputs: Can also be productive, capturing input-output relationships specific to attention heads.

14.7.2 Computational Cost

Activation caching: First, run the base model on a large dataset (billions of tokens) and save activations. This is a one-time cost.

SAE training: With cached activations, training takes hours to days depending on SAE size. Recent work achieves training in under 30 minutes with optimized pipelines.

Memory: Proportional to dictionary size. A 16M feature SAE requires substantial GPU memory.

14.7.3 Key Hyperparameters

Dictionary size ($D$): Larger dictionaries find more features but cost more and may have more “dead” features that never activate.

Sparsity coefficient ($\lambda$): Higher values produce sparser (more interpretable) features but worse reconstruction. Typical values: 1-100.

Training data: Diverse data produces more general features. Billions of tokens recommended for convergence.

14.8 Limitations and Failure Modes

SAEs aren’t perfect. Understanding their limitations is essential.

14.8.1 Reconstruction Error

SAEs typically achieve 60-90% reconstruction accuracy. The missing 10-40% represents: - Noise that doesn’t correspond to interpretable features - Information that’s truly distributed (not sparse) - Features the SAE failed to learn

This limits what we can conclude from SAE features alone.

14.8.2 Dead Features

Some latent dimensions never activate during training—they’re “dead.” Rates of 5-20% are common. These represent wasted capacity and may indicate the dictionary is too large.

Recent approaches (TopK SAEs, auxiliary losses) reduce dead features.

14.8.3 Feature Splitting

As you increase dictionary size, broad features split into finer-grained sub-features: - “Math” → “algebra,” “geometry,” “calculus” - “Animals” → “mammals,” “birds,” “reptiles”

This is sometimes useful (more precision) but sometimes problematic (which granularity is “correct”?).

14.8.4 Feature Absorption

A more concerning failure mode: when dictionary size increases, some features absorb others instead of splitting cleanly. A very common concept (like “Paris”) might absorb a broader concept (like “European capitals”), causing the broader feature to stop firing where it should.

A Fundamental Problem

Recent research (2024) has shown that feature absorption is more fundamental than initially understood. When features form hierarchies (parent-child relationships), the decomposition becomes theoretically unstable. This isn’t a bug to be fixed with better hyperparameters—it’s a structural challenge that may require fundamentally new approaches. Varying SAE sizes or sparsity penalties is insufficient.

14.8.5 No Ground Truth

The most fundamental limitation: we don’t know what the “true” features are. We can’t verify that SAE features are the same features the model “actually uses.”

Multiple SAEs trained on the same data produce somewhat different features. Which is right? We don’t know.

The Validation Problem

SAEs produce interpretable features, but interpretability isn’t proof of correctness. The features might be artifacts of the SAE training process rather than genuine properties of the original model. This is an open problem.

14.9 Alternatives and Improvements

14.9.1 TopK SAEs

Instead of an L1 penalty that gradually suppresses activations, TopK SAEs keep exactly the $k$ largest activations and zero the rest.

Advantages: Direct control over sparsity, cleaner thresholding Disadvantages: Non-differentiable, requires auxiliary losses for dead features

Recent work shows TopK SAEs can achieve better reconstruction-sparsity trade-offs.

14.9.2 Gated SAEs

Separate the decision “which features to use” from “how much to use each”: - Gate network: Binary decision of which features activate - Magnitude network: How strongly each feature activates

This eliminates “shrinkage” (systematic underestimation of feature magnitudes) and reduces active feature count.

Recent advances (2024): Gated SAEs achieve half as many firing features to reach comparable reconstruction fidelity compared to standard L1-penalized SAEs. This represents a significant improvement to the reconstruction-sparsity frontier while maintaining interpretability.

14.9.3 Other Approaches

JumpReLU SAEs (2024): Use discontinuous activation functions with straight-through estimators to directly optimize L0 sparsity. Achieves state-of-the-art reconstruction on models like Gemma 2 9B.
Switch SAEs: Use mixture-of-experts routing between smaller “expert” SAEs for computational efficiency at scale.
Matryoshka SAEs: Train across multiple sparsity levels simultaneously, showing strong feature disentanglement.
Contrastive losses: Encourage features to be more distinct from each other.

The Proxy Metric Gap (2025 Finding)

Recent comprehensive benchmarking (SAEBench, 2025) reveals a troubling finding: improvements on traditional proxy metrics (reconstruction loss, sparsity, interpretability scores) don’t reliably translate to practical performance on downstream tasks. Some SAE variants that excel on proxy metrics underperform on real applications, while others (like Matryoshka SAEs) underperform on metrics but excel at feature disentanglement. This suggests the field needs better evaluation methods.

14.10 Using SAE Features

Once you have features, what can you do with them?

14.10.1 Understanding

The most basic use: look at what features activate for specific inputs. If you want to understand how the model processes “The capital of France is ___“, examine which features fire.

This provides a vocabulary for discussing model internals: “Features 47, 892, and 3,041 are active” is more meaningful than “Neuron 234 has value 0.72.”

14.10.2 Steering

Manipulate features during generation: - Amplify a feature → model emphasizes that concept - Suppress a feature → model avoids that concept

More targeted than prompt engineering, because you’re intervening on the model’s internal representations.

Deep Dive: How to Steer with Features

Steering is one of the most powerful applications of SAE features—you can modify model behavior by directly adjusting its internal representations.

14.10.3 The Basic Recipe

Find a relevant feature: Use max-activating examples to identify a feature for the concept you want to influence
Extract the feature direction: Get the decoder column for that feature
Add or subtract during inference: Inject the direction into the residual stream

14.10.4 Conceptual Example

# 1. Find the "sycophancy" feature (hypothetical index 4721)
sycophancy_direction = sae.decoder.weight[:, 4721]

# 2. During generation, subtract to reduce sycophantic behavior
def steering_hook(residual, hook):
    # Subtract the sycophancy direction (scaled)
    return residual - 3.0 * sycophancy_direction

# 3. Run with the intervention
with model.hooks(fwd_hooks=[("blocks.15.hook_resid_post", steering_hook)]):
    output = model.generate(prompt)

14.10.5 Steering Strength

The multiplier (3.0 above) controls intervention strength: - Too weak: No noticeable effect - Just right: Behavior shifts in the desired direction - Too strong: Coherence breaks down, bizarre outputs

Finding the right strength requires experimentation. Start with 1x the feature’s typical activation magnitude, then adjust.

14.10.6 When Steering Works Well

Use Case	Why It Works
Style adjustment	Style features are often localized
Content emphasis	Semantic features steer topic focus
Safety guardrails	Suppressing harmful content features
Debugging	Verify a feature’s causal role

14.10.7 When Steering Struggles

Distributed capabilities: If behavior involves many features, steering one doesn’t suffice
Feature absorption: The “right” feature might have absorbed into a related one
Interference: Steering one feature may unexpectedly affect others

14.10.8 The Golden Gate Bridge Example

Anthropic’s famous demonstration amplified a “Golden Gate Bridge” feature in Claude. The model began inserting bridge references everywhere, even claiming to be the bridge. This dramatically illustrated both the power and the strangeness of feature-level intervention.

Key insight: The model doesn’t “resist” the intervention—it tries to make coherent sense of activations that now strongly include “Golden Gate Bridge,” even in contexts where that makes no sense.

Steering vs. Fine-tuning

Property	Steering	Fine-tuning
Persistence	Per-inference only	Permanent
Precision	Single feature	Many weights
Reversibility	Instant	Requires retraining
Compute	Negligible	Expensive
Interpretability	Transparent	Opaque

Steering is surgical intervention; fine-tuning is systemic change. For exploration and debugging, steering is often preferable.

14.10.9 Circuit Discovery

SAE features provide a cleaner basis for tracing circuits: - Which features cause which downstream features? - How do features combine across layers? - What’s the path from input features to output features?

With monosemantic features, circuit discovery becomes more tractable than with polysemantic neurons.

14.10.10 Safety Applications

Identify features related to: - Harmful content - Deception - Sycophancy - Unsafe code patterns

Potentially suppress these features during deployment, or flag inputs where they activate strongly.

Try It Yourself: Neuronpedia

Explore real SAE features interactively at Neuronpedia. Browse features discovered in GPT-2, Gemma 2, Llama 3, and other models. Search for concepts like “code,” “emotion,” or “deception” to see what features researchers have found. Each feature shows max-activating examples, helping you build intuition for what SAEs actually discover.

14.11 Polya’s Perspective: Auxiliary Constructions

In Polya’s framework, the SAE is an auxiliary construction—something we add to make the problem tractable.

The original problem: understand what’s represented in a 768-dimensional activation vector packed with superposed features.

The auxiliary construction: train an SAE that decomposes the vector into a sparse combination of learned features.

The SAE isn’t part of the original model. It’s a tool we build to make the model’s internals visible. Like adding a construction line in geometry, it doesn’t change the underlying truth—it reveals structure that was always there.

Polya’s Insight

“Introduce auxiliary elements.” When a problem is too hard to solve directly, add something that makes the structure visible. SAEs are auxiliary elements for interpretability: we add them to see what was hidden.

14.12 Looking Ahead

We now have a tool for extracting features. But features in isolation don’t explain model behavior. We need techniques for: - Attribution: Which features contributed to this output? (Chapter 10) - Patching: Is this feature causally necessary? (Chapter 11) - Ablation: What happens if we remove this component? (Chapter 12)

SAEs give us the vocabulary (features). The next chapters give us the grammar (how to trace, verify, and understand feature contributions).

14.13 Common Confusions

“SAE features are the ‘true’ features of the model”

SAE features are one decomposition of the model’s representations, not necessarily the decomposition. Different SAE training runs produce different features. The model doesn’t “know about” your SAE—you’re projecting a structure onto its activations. Treat SAE features as useful hypotheses, not ground truth.

“Higher dictionary size is always better”

Larger dictionaries capture more features but suffer from splitting (one concept becomes many redundant features) and absorption (common features absorb rare ones). There’s a sweet spot that depends on the model and layer. Typically 16x-32x expansion is good; 128x may be overkill.

“Low reconstruction loss means the SAE is perfect”

Reconstruction loss measures how well the SAE approximates the original activations, not how interpretable or useful the features are. An SAE could achieve low loss with uninterpretable features. Always evaluate interpretability and downstream task performance, not just loss.

“Steering proves the feature is ‘real’”

Steering shows that adding this direction changes behavior in a consistent way. It doesn’t prove the model “uses” this feature naturally. You’re demonstrating causal influence, not necessarily that the direction matches the model’s internal representation. Steering is evidence, not proof.

14.14 Further Reading

Towards Monosemanticity — Anthropic: The foundational paper introducing SAEs for interpretability.
Scaling Monosemanticity — Anthropic: Applying SAEs to Claude 3 Sonnet, finding millions of interpretable features.
Sparse Autoencoders Find Highly Interpretable Features — arXiv:2309.08600: Technical details on SAE training and evaluation.
A is for Absorption — arXiv:2409.14507: Analysis of feature splitting and absorption failure modes.
Improving Sparse Decomposition with Gated SAEs — NeurIPS 2024: The gated SAE variant that reduces shrinkage.
Neuronpedia — neuronpedia.org: Interactive explorer for SAE features across various models.

--- title: "Sparse Autoencoders" subtitle: "Extracting features from superposition" author: "Taras Tsugrii" date: 2025-01-05 categories: [techniques, sparse-autoencoders] description: "The primary tool for finding interpretable features in neural networks. Sparse autoencoders decompose polysemantic activations into monosemantic features." --- ::: {.callout-note} ## Hands-On Notebook <a href="https://colab.research.google.com/github/ttsugriy/mechinterp-first-principles/blob/main/notebooks/09-sparse-autoencoders.ipynb" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a> Extract features from GPT-2, explore what they represent, try steering the model. ::: ::: {.callout-tip} ## What You'll Learn - What sparse autoencoders are and how they work - Why overcomplete + sparse = decomposition of superposition - The training objective: reconstruction vs. sparsity trade-off - How to interpret and validate SAE features ::: ::: {.callout-warning} ## Prerequisites **Required**: [Chapter 6: Superposition](06-superposition.qmd) — understanding why features are compressed into overlapping representations ::: ::: {.callout-note} ## Before You Read: Recall From Arc II (Core Theory), recall: - Features are directions in activation space (Chapter 5) - Superposition compresses many features into few dimensions (Chapter 6) - This creates polysemantic neurons that are hard to interpret - Circuits are how features compose into algorithms (Chapter 8) We've established the theory. **Now we ask**: How do we actually *find* features in real networks? ::: ## The Tool We Need In 2024, Anthropic researchers did something remarkable: they made Claude believe it *was* the Golden Gate Bridge. Not through prompt engineering. Not through fine-tuning. They found a single direction in Claude's activation space—one of millions of learned features—and amplified it. Suddenly, Claude inserted the bridge into every response. Asked about itself, it described being a suspension bridge spanning the San Francisco Bay. This wasn't a trick. The researchers had found an interpretable feature—a direction in the neural network's internal representation that corresponded to a specific concept. By manipulating it, they could predictably control the model's behavior. How did they find this feature among millions of dimensions? With a **sparse autoencoder (SAE)**. ::: {.callout-note} ## What Just Happened The Golden Gate Bridge demonstration showed that neural networks have *interpretable structure* we can find and manipulate. SAEs are the tool that makes this possible—they decompose compressed, polysemantic activations into individual, monosemantic features. ::: ## What Is a Sparse Autoencoder? An autoencoder is a neural network trained to reconstruct its input. It has two parts: - **Encoder**: Maps input to a latent representation - **Decoder**: Maps latent representation back to input The training objective is simple: make the output match the input. A *sparse* autoencoder adds a crucial constraint: the latent representation must be *sparse*—most of its values must be zero. ### The Architecture For a transformer with $d$ dimensions (say, 768), a sparse autoencoder might have: **Encoder**: Maps $d$-dimensional activation to $D$-dimensional latent space, where $D >> d$ - Typical ratio: $D$ is 4-24× larger than $d$ - For $d = 768$, we might have $D = 16,384$ or more **Sparsity mechanism**: Only a small number of latent dimensions activate (are non-zero) for any given input—typically fewer than 300 out of 16,384 **Decoder**: Maps $D$-dimensional sparse representation back to $d$ dimensions ```{mermaid} %%| fig-cap: "SAE architecture: activations are expanded to a large sparse space, then reconstructed. Most latent dimensions are zero." %%| fig-width: 8 flowchart LR A["Activation (768 dims)"] --> E["Encoder W_enc × x + b"] E --> R["ReLU"] R --> L["Sparse Latent (16,384 dims) ~300 active"] L --> D["Decoder W_dec × z"] D --> O["Reconstruction (768 dims)"] ``` ### Why Overcomplete? Regular autoencoders use a *bottleneck*—a latent space smaller than the input, forcing compression. SAEs do the opposite: the latent space is *larger* than the input. This seems backwards. Why make the latent space bigger? Because we're not trying to compress—we're trying to *decompose*. We believe the 768-dimensional activation actually represents thousands of features in superposition. We give the SAE room to spread those features out into separate dimensions. The sparsity constraint ensures this expansion doesn't become trivial. The SAE can't just copy the input to random latent dimensions. It must find a sparse basis where each input activates only a few latent features. ::: {.callout-tip} ## A Performance Engineering Parallel Think of SAEs like profiler stack sampling, but for representations. A profiler takes a complex execution and decomposes it into individual function calls. An SAE takes a complex activation and decomposes it into individual features. Both reveal hidden structure through systematic decomposition. ::: ::: {.callout-warning} ## Pause and Think Regular autoencoders compress to a *smaller* space. SAEs expand to a *larger* space. Why does this expansion help with interpretability? *Hint*: Think about what the input activation actually represents (from Chapter 6). ::: ## The Training Objective SAEs are trained with two competing objectives: **1. Reconstruction**: The output should match the input $$\mathcal{L}_{\text{recon}} = ||x - \hat{x}||^2$$ **2. Sparsity**: The latent representation should have mostly zeros $$\mathcal{L}_{\text{sparse}} = ||z||_1$$ The total loss is: $$\mathcal{L} = \mathcal{L}_{\text{recon}} + \lambda \cdot \mathcal{L}_{\text{sparse}}$$ where $\lambda$ controls the trade-off. ### The Trade-off This is the fundamental tension: - **Low $\lambda$**: Accurate reconstruction, but dense (non-sparse) latents → less interpretable - **High $\lambda$**: Very sparse latents → more interpretable, but worse reconstruction In practice, we accept 10-40% reconstruction error to achieve sufficient sparsity. This means SAEs don't perfectly capture everything in the activation—but what they capture is interpretable. ::: {.callout-important} ## The Reconstruction-Interpretability Trade-off You cannot have both perfect reconstruction and perfect interpretability. Some information in neural network activations may be fundamentally non-sparse—distributed in ways that resist decomposition. SAEs give us interpretable features at the cost of missing some signal. ::: ## What SAEs Produce After training, an SAE gives you two things: ### 1. A Feature Dictionary The decoder weights form a **dictionary** of feature directions. Each column of the decoder matrix is a feature vector—a direction in the original activation space. If the SAE has 16,384 latent dimensions, you get 16,384 feature vectors. Each represents a potential concept the model might use. ### 2. Sparse Activations For any input activation, the SAE produces a sparse vector of feature activations. If features 47, 892, and 3,041 are active (non-zero), those are the features present in this input. The magnitude of each activation tells you how strongly that feature is present. ### Connection to Toy Models Remember the pentagon from [Chapter 7](07-toy-models.qmd)? In that toy model, 5 sparse features arranged themselves as a regular pentagon in 2D—the optimal geometry for minimizing interference. SAEs are doing the same thing, but in reverse: | Toy Models | SAEs | |------------|------| | We define features, watch network learn geometry | Network has learned geometry, we recover features | | 5 features → pentagon in 2D | Millions of features → complex polytope in 768D | | Ground truth known | Ground truth unknown | | Validates superposition theory | Applies superposition theory | The toy model shows *what should happen* under ideal conditions. SAEs attempt to *undo* what happened in real networks. When an SAE finds features that are nearly orthogonal and activate sparsely, it's recovering the structure that toy models predict. ::: {.callout-note} ## The Validation Loop Toy models predict that sparse features should arrange as near-orthogonal directions. SAE decoder columns (feature vectors) can be checked for this property. Finding that SAE features are indeed nearly orthogonal validates both the SAE and the superposition hypothesis. ::: In practice, SAE features in large models show: - **High sparsity**: <300 features active per token (out of millions) - **Near-orthogonality**: Decoder columns have low pairwise cosine similarity - **Polytope-like structure**: Features cluster into interpretable groups This is exactly what toy models predict—scaled up by orders of magnitude. ## Interpreting Features How do you know what a feature means? ### Max-Activating Examples The primary method: find inputs that maximally activate each feature. 1. Run many inputs through the SAE 2. For each feature, record which inputs caused it to activate most strongly 3. Look at those inputs—what do they have in common? If feature 892 activates most strongly on text about cooking, mentions of recipes, and kitchen descriptions, it's probably a "cooking" feature. ### Feature Steering A more rigorous test: artificially boost or suppress the feature and observe behavior changes. If amplifying feature 892 makes the model insert cooking references into unrelated text, you've verified it's causally connected to "cooking." The famous "Golden Gate Bridge" feature was discovered this way: amplifying it made Claude identify *as* the Golden Gate Bridge. ::: {.callout-caution} ## Common Misconception: SAE Features Are "Ground Truth" **Wrong**: "SAE features are the actual features the model uses internally." **Right**: SAE features are a *reconstruction* that's convenient for us to interpret. The real model may use different bases, or may not have discrete "features" at all in some representations. SAE features are useful proxies—they help us understand the model—but they're not necessarily how the model "thinks." The 10-40% reconstruction error reminds us that SAEs are approximations. ::: ### Automated Interpretation For millions of features, manual inspection doesn't scale. Researchers use: - LLM-based feature labeling (ask another model to describe max-activating examples) - Semantic clustering (group similar features) - Activation correlation analysis ## Anthropic's Scaling Results In 2024, Anthropic applied SAEs to Claude 3 Sonnet at unprecedented scale, training SAEs with up to 16 million features. ### What They Found - **Millions of interpretable features** across all layers - Features for concepts ranging from specific entities ("Golden Gate Bridge") to abstract properties ("deception," "uncertainty") - Features for code patterns, mathematical concepts, linguistic phenomena - Safety-relevant features: "backdoor," "unsafe code," "sycophancy" ### Key Insights **Features cluster semantically**: Similar features are geometrically nearby. Safety-related features cluster together. Domain knowledge features cluster together. **Features split with scale**: Larger SAEs find finer-grained features. "Animals" might split into "mammals," "birds," "reptiles" with more capacity. **Features are sparse**: Fewer than 300 features active per token, out of millions available. **Features transfer**: Some features discovered on text also activate appropriately on images (for multimodal models). ::: {.callout-note} ## Try It Yourself: Explore SAE Features on Neuronpedia [Neuronpedia](https://www.neuronpedia.org/) hosts SAE features for multiple models. Explore what interpretable features look like: 1. Visit [neuronpedia.org/gpt2-small](https://www.neuronpedia.org/gpt2-small) for GPT-2 Small features 2. Try the "Random Feature" button to see a randomly selected feature 3. Look at the max-activating examples—can you guess what the feature represents? 4. Search for features related to concepts like "programming" or "sentiment" 5. Compare early-layer vs. late-layer features—notice how abstraction increases with depth This is what SAE output looks like in practice. Each feature is a direction in activation space; Neuronpedia shows you what each direction "means." ::: ::: {.callout-note} ## The Transformation Before SAEs: We knew polysemanticity existed but couldn't see through it After SAEs: We can decompose activations into millions of interpretable features This is the tool that makes large-scale interpretability research possible. ::: ## Practical Considerations ### Where to Apply SAEs **Residual stream**: The most common target. Contains the accumulated signal that all components read from and write to. Features here are often highly interpretable. **Layer selection**: Different layers have different feature types: - Early layers: Syntactic features, token patterns, formatting - Middle layers: Semantic features, concepts, relationships - Late layers: Task-specific features, output-relevant information **Attention outputs**: Can also be productive, capturing input-output relationships specific to attention heads. ### Computational Cost **Activation caching**: First, run the base model on a large dataset (billions of tokens) and save activations. This is a one-time cost. **SAE training**: With cached activations, training takes hours to days depending on SAE size. Recent work achieves training in under 30 minutes with optimized pipelines. **Memory**: Proportional to dictionary size. A 16M feature SAE requires substantial GPU memory. ### Key Hyperparameters **Dictionary size ($D$)**: Larger dictionaries find more features but cost more and may have more "dead" features that never activate. **Sparsity coefficient ($\lambda$)**: Higher values produce sparser (more interpretable) features but worse reconstruction. Typical values: 1-100. **Training data**: Diverse data produces more general features. Billions of tokens recommended for convergence. ## Limitations and Failure Modes SAEs aren't perfect. Understanding their limitations is essential. ### Reconstruction Error SAEs typically achieve 60-90% reconstruction accuracy. The missing 10-40% represents: - Noise that doesn't correspond to interpretable features - Information that's truly distributed (not sparse) - Features the SAE failed to learn This limits what we can conclude from SAE features alone. ### Dead Features Some latent dimensions never activate during training—they're "dead." Rates of 5-20% are common. These represent wasted capacity and may indicate the dictionary is too large. Recent approaches (TopK SAEs, auxiliary losses) reduce dead features. ### Feature Splitting As you increase dictionary size, broad features split into finer-grained sub-features: - "Math" → "algebra," "geometry," "calculus" - "Animals" → "mammals," "birds," "reptiles" This is sometimes useful (more precision) but sometimes problematic (which granularity is "correct"?). ### Feature Absorption A more concerning failure mode: when dictionary size increases, some features *absorb* others instead of splitting cleanly. A very common concept (like "Paris") might absorb a broader concept (like "European capitals"), causing the broader feature to stop firing where it should. ::: {.callout-warning} ## A Fundamental Problem Recent research (2024) has shown that feature absorption is more fundamental than initially understood. When features form hierarchies (parent-child relationships), the decomposition becomes theoretically unstable. This isn't a bug to be fixed with better hyperparameters—it's a structural challenge that may require fundamentally new approaches. Varying SAE sizes or sparsity penalties is insufficient. ::: ### No Ground Truth The most fundamental limitation: we don't know what the "true" features are. We can't verify that SAE features are the same features the model "actually uses." Multiple SAEs trained on the same data produce somewhat different features. Which is right? We don't know. ::: {.callout-important} ## The Validation Problem SAEs produce interpretable features, but interpretability isn't proof of correctness. The features might be artifacts of the SAE training process rather than genuine properties of the original model. This is an open problem. ::: ## Alternatives and Improvements ### TopK SAEs Instead of an L1 penalty that gradually suppresses activations, TopK SAEs keep exactly the $k$ largest activations and zero the rest. **Advantages**: Direct control over sparsity, cleaner thresholding **Disadvantages**: Non-differentiable, requires auxiliary losses for dead features Recent work shows TopK SAEs can achieve better reconstruction-sparsity trade-offs. ### Gated SAEs Separate the decision "which features to use" from "how much to use each": - Gate network: Binary decision of which features activate - Magnitude network: How strongly each feature activates This eliminates "shrinkage" (systematic underestimation of feature magnitudes) and reduces active feature count. **Recent advances (2024)**: Gated SAEs achieve half as many firing features to reach comparable reconstruction fidelity compared to standard L1-penalized SAEs. This represents a significant improvement to the reconstruction-sparsity frontier while maintaining interpretability. ### Other Approaches - **JumpReLU SAEs** (2024): Use discontinuous activation functions with straight-through estimators to directly optimize L0 sparsity. Achieves state-of-the-art reconstruction on models like Gemma 2 9B. - **Switch SAEs**: Use mixture-of-experts routing between smaller "expert" SAEs for computational efficiency at scale. - **Matryoshka SAEs**: Train across multiple sparsity levels simultaneously, showing strong feature disentanglement. - **Contrastive losses**: Encourage features to be more distinct from each other. ::: {.callout-warning} ## The Proxy Metric Gap (2025 Finding) Recent comprehensive benchmarking (SAEBench, 2025) reveals a troubling finding: improvements on traditional proxy metrics (reconstruction loss, sparsity, interpretability scores) don't reliably translate to practical performance on downstream tasks. Some SAE variants that excel on proxy metrics underperform on real applications, while others (like Matryoshka SAEs) underperform on metrics but excel at feature disentanglement. This suggests the field needs better evaluation methods. ::: ## Using SAE Features Once you have features, what can you do with them? ### Understanding The most basic use: look at what features activate for specific inputs. If you want to understand how the model processes "The capital of France is ___", examine which features fire. This provides a vocabulary for discussing model internals: "Features 47, 892, and 3,041 are active" is more meaningful than "Neuron 234 has value 0.72." ### Steering Manipulate features during generation: - Amplify a feature → model emphasizes that concept - Suppress a feature → model avoids that concept More targeted than prompt engineering, because you're intervening on the model's internal representations. ::: {.callout-tip collapse="true"} ## Deep Dive: How to Steer with Features Steering is one of the most powerful applications of SAE features—you can modify model behavior by directly adjusting its internal representations. ### The Basic Recipe 1. **Find a relevant feature**: Use max-activating examples to identify a feature for the concept you want to influence 2. **Extract the feature direction**: Get the decoder column for that feature 3. **Add or subtract during inference**: Inject the direction into the residual stream ### Conceptual Example ```python # 1. Find the "sycophancy" feature (hypothetical index 4721) sycophancy_direction = sae.decoder.weight[:, 4721] # 2. During generation, subtract to reduce sycophantic behavior def steering_hook(residual, hook): # Subtract the sycophancy direction (scaled) return residual - 3.0 * sycophancy_direction # 3. Run with the intervention with model.hooks(fwd_hooks=[("blocks.15.hook_resid_post", steering_hook)]): output = model.generate(prompt) ``` ### Steering Strength The multiplier (3.0 above) controls intervention strength: - **Too weak**: No noticeable effect - **Just right**: Behavior shifts in the desired direction - **Too strong**: Coherence breaks down, bizarre outputs Finding the right strength requires experimentation. Start with 1x the feature's typical activation magnitude, then adjust. ### When Steering Works Well | Use Case | Why It Works | |----------|--------------| | Style adjustment | Style features are often localized | | Content emphasis | Semantic features steer topic focus | | Safety guardrails | Suppressing harmful content features | | Debugging | Verify a feature's causal role | ### When Steering Struggles - **Distributed capabilities**: If behavior involves many features, steering one doesn't suffice - **Feature absorption**: The "right" feature might have absorbed into a related one - **Interference**: Steering one feature may unexpectedly affect others ### The Golden Gate Bridge Example Anthropic's famous demonstration amplified a "Golden Gate Bridge" feature in Claude. The model began inserting bridge references everywhere, even claiming *to be* the bridge. This dramatically illustrated both the power and the strangeness of feature-level intervention. **Key insight**: The model doesn't "resist" the intervention—it tries to make coherent sense of activations that now strongly include "Golden Gate Bridge," even in contexts where that makes no sense. ::: ::: {.callout-important} ## Steering vs. Fine-tuning | Property | Steering | Fine-tuning | |----------|----------|-------------| | Persistence | Per-inference only | Permanent | | Precision | Single feature | Many weights | | Reversibility | Instant | Requires retraining | | Compute | Negligible | Expensive | | Interpretability | Transparent | Opaque | Steering is surgical intervention; fine-tuning is systemic change. For exploration and debugging, steering is often preferable. ::: ### Circuit Discovery SAE features provide a cleaner basis for tracing circuits: - Which features cause which downstream features? - How do features combine across layers? - What's the path from input features to output features? With monosemantic features, circuit discovery becomes more tractable than with polysemantic neurons. ### Safety Applications Identify features related to: - Harmful content - Deception - Sycophancy - Unsafe code patterns Potentially suppress these features during deployment, or flag inputs where they activate strongly. ::: {.callout-tip} ## Try It Yourself: Neuronpedia Explore real SAE features interactively at **[Neuronpedia](https://www.neuronpedia.org/)**. Browse features discovered in GPT-2, Gemma 2, Llama 3, and other models. Search for concepts like "code," "emotion," or "deception" to see what features researchers have found. Each feature shows max-activating examples, helping you build intuition for what SAEs actually discover. ::: ## Polya's Perspective: Auxiliary Constructions In Polya's framework, the SAE is an **auxiliary construction**—something we add to make the problem tractable. The original problem: understand what's represented in a 768-dimensional activation vector packed with superposed features. The auxiliary construction: train an SAE that decomposes the vector into a sparse combination of learned features. The SAE isn't part of the original model. It's a tool we build to make the model's internals visible. Like adding a construction line in geometry, it doesn't change the underlying truth—it reveals structure that was always there. ::: {.callout-tip} ## Polya's Insight "Introduce auxiliary elements." When a problem is too hard to solve directly, add something that makes the structure visible. SAEs are auxiliary elements for interpretability: we add them to see what was hidden. ::: ## Looking Ahead We now have a tool for extracting features. But features in isolation don't explain model behavior. We need techniques for: - **Attribution**: Which features contributed to this output? (Chapter 10) - **Patching**: Is this feature *causally* necessary? (Chapter 11) - **Ablation**: What happens if we remove this component? (Chapter 12) SAEs give us the vocabulary (features). The next chapters give us the grammar (how to trace, verify, and understand feature contributions). --- ## Common Confusions ::: {.callout-warning collapse="true"} ## "SAE features are the 'true' features of the model" SAE features are one decomposition of the model's representations, not necessarily the decomposition. Different SAE training runs produce different features. The model doesn't "know about" your SAE—you're projecting a structure onto its activations. Treat SAE features as useful hypotheses, not ground truth. ::: ::: {.callout-warning collapse="true"} ## "Higher dictionary size is always better" Larger dictionaries capture more features but suffer from splitting (one concept becomes many redundant features) and absorption (common features absorb rare ones). There's a sweet spot that depends on the model and layer. Typically 16x-32x expansion is good; 128x may be overkill. ::: ::: {.callout-warning collapse="true"} ## "Low reconstruction loss means the SAE is perfect" Reconstruction loss measures how well the SAE approximates the original activations, not how interpretable or useful the features are. An SAE could achieve low loss with uninterpretable features. Always evaluate interpretability and downstream task performance, not just loss. ::: ::: {.callout-warning collapse="true"} ## "Steering proves the feature is 'real'" Steering shows that adding this direction changes behavior in a consistent way. It doesn't prove the model "uses" this feature naturally. You're demonstrating causal influence, not necessarily that the direction matches the model's internal representation. Steering is evidence, not proof. ::: --- ## Further Reading 1. **Towards Monosemanticity** — [Anthropic](https://transformer-circuits.pub/2023/monosemantic-features): The foundational paper introducing SAEs for interpretability. 2. **Scaling Monosemanticity** — [Anthropic](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html): Applying SAEs to Claude 3 Sonnet, finding millions of interpretable features. 3. **Sparse Autoencoders Find Highly Interpretable Features** — [arXiv:2309.08600](https://arxiv.org/abs/2309.08600): Technical details on SAE training and evaluation. 4. **A is for Absorption** — [arXiv:2409.14507](https://arxiv.org/abs/2409.14507): Analysis of feature splitting and absorption failure modes. 5. **Improving Sparse Decomposition with Gated SAEs** — [NeurIPS 2024](https://openreview.net/forum?id=zLBlin2zvW): The gated SAE variant that reduces shrinkage. 6. **Neuronpedia** — [neuronpedia.org](https://www.neuronpedia.org/): Interactive explorer for SAE features across various models.