9 Features: The Atoms of Representation

What neural networks actually represent

core-theory

features

Author

Taras Tsugrii

Published

January 5, 2025

What You’ll Learn

Why neurons are the wrong unit of analysis (polysemanticity)
What features actually are (directions, not neurons)
The difference between monosemantic and polysemantic representations
The feature ontology problem: what features “should” exist?

Prerequisites

Required: Chapter 4: Geometry — understanding activations as vectors in high-dimensional space

Before You Read: Recall

From Chapters 1-4 (Arc I: Foundations), recall:

Activations are vectors in high-dimensional space (Chapter 4)
Vector arithmetic reveals meaningful structure: king - man + woman ≈ queen (Chapter 4)
The residual stream accumulates contributions additively (Chapter 3)
Transformers are matrix multiplication machines (Chapter 2)

We’ve built the computational substrate. Now we ask: What are the atoms of meaning in this space?

9.1 The Neuron That Liked Cats and Cars

In 2017, researchers at Google were trying to understand what individual neurons in an image classifier had learned. The approach seemed natural: look at which images maximally activate each neuron, and you’ll learn what that neuron “means.”

For some neurons, this worked beautifully. One neuron activated strongly for dog faces. Another for horizontal lines. Another for the color orange. These were interpretable—you could describe what they detected.

Then they found a neuron that activated for:

Cat faces
Cat legs
The fronts of cars

These are not related concepts. There’s no obvious property shared by cat anatomy and automobile design. The researchers tested whether the neuron might be detecting some abstract quality like “curvedness” or “sleekness.” It wasn’t—curved objects like snakes and ferrets didn’t activate it.

The neuron simply responded to cat parts and car fronts. It was polysemantic—encoding multiple unrelated features in a single unit.

The Core Problem

This isn’t a rare bug. It’s the common case. Most neurons in trained neural networks are polysemantic—they respond to multiple, semantically unrelated features. The neuron is not the right unit of analysis.

This chapter introduces the concept that replaces the neuron: the feature. Understanding what features are, why they matter, and why they’re hard to find is the foundation of mechanistic interpretability.

9.2 Why Neurons Fail

Let’s understand why polysemanticity exists before defining what replaces it.

Consider a simple thought experiment. Suppose you’re training a network to recognize 10,000 different concepts: animals, objects, colors, textures, emotions, actions, and so on. Your network has 1,000 neurons in a given layer.

If each neuron encoded exactly one concept, you could represent at most 1,000 concepts. But you need 10,000.

What does the network do? It shares neurons across concepts. Multiple features get encoded in overlapping patterns of neuron activations. A single neuron might participate in representing “cat,” “car,” “curve,” and “corporate logo”—not because these are related, but because the network needs to pack more information than it has dedicated slots for.

This is efficient. It works because most features rarely co-occur: you don’t often see cat faces and car fronts in the same image, so using the same neuron for both doesn’t cause problems during inference. The network gets away with the overlap.

But it’s a disaster for interpretation. When you ask “what does neuron 847 mean?”, the answer is “it means cat faces and car fronts and probably several other things.” The neuron doesn’t have a single meaning. It’s a superposition of meanings (we’ll explore this phenomenon deeply in Chapter 6).

Incidental Polysemanticity

Recent research has revealed something even more striking: polysemanticity can emerge even when there’s no capacity pressure. Networks sometimes assign multiple unrelated features to the same neuron purely due to random initialization and training dynamics. Polysemanticity isn’t just about efficiency—it’s a fundamental property of how neural networks learn.

Try It Yourself: Watch Neuron Interpretation Fail

If you want to experience this frustration firsthand, try this in a notebook:

import transformer_lens as tl

model = tl.HookedTransformer.from_pretrained("gpt2-small")

# Pick a random MLP neuron
layer, neuron = 6, 1234

# Collect texts that highly activate this neuron
test_texts = [
    "The weather today is beautiful and sunny",
    "She walked to the store to buy groceries",
    "The president announced a new policy",
    "I love programming in Python",
    "The cat sat on the mat",
    # ... add more diverse texts
]

for text in test_texts:
    tokens = model.to_tokens(text)
    _, cache = model.run_with_cache(tokens)

    # Get MLP neuron activation (after ReLU/GELU)
    mlp_acts = cache["post", layer][0, :, neuron]  # All positions

    max_act = mlp_acts.max().item()
    max_pos = mlp_acts.argmax().item()
    max_token = model.to_str_tokens(text)[max_pos]

    if max_act > 1.0:  # Threshold for "high activation"
        print(f"Act={max_act:.2f} at '{max_token}' in: {text[:50]}...")

What you’ll likely find: The neuron activates for seemingly random words across unrelated texts. There’s no clean interpretation. This is the problem.

Now compare with an SAE feature (we’ll cover this in Chapter 9):

from sae_lens import SAE

sae, _, _ = SAE.from_pretrained("gpt2-small-res-jb", "blocks.6.hook_resid_pre")
# Feature 4521 might be "words related to science"
# Max-activating examples will be coherent

The SAE feature will have a clean interpretation. The raw neuron won’t. That’s the difference between neurons and features.

9.3 What Is a Feature?

If neurons aren’t the right unit, what is?

A feature is a property of the input that the network represents internally.

Let’s unpack this:

Property of the input: Features describe something about what the network is processing. “Is this an animal?” “Is this code?” “Is this sentence expressing anger?” “Does this image contain curves?”
Represented internally: The network has learned to detect this property. When the property is present, something specific happens in the network’s activations. When it’s absent, that something doesn’t happen.

Examples of features: - “This text is in French” - “This image contains a face” - “This code has a bug” - “This sentence is sarcastic” - “The current token follows the pattern [A][B]…[A], predicting [B]”

Features range from low-level (“this is a curve,” “this token starts with ‘S’”) to high-level (“this text discusses deception,” “this image is the Golden Gate Bridge”).

A Performance Engineering Parallel

Features are like the concepts in a profiler’s ontology. When you profile a system, you don’t think about individual CPU cycles—you think about “database queries,” “network I/O,” “garbage collection.” These are higher-level features of the computation. Similarly, mechanistic interpretability seeks the higher-level features of neural computation, not the individual neuron activations.

9.4 Features as Directions

In Chapter 4, we established that neural network activations are vectors in high-dimensional space, and that geometric structure carries meaning.

Here’s the key claim: features are directions in activation space.

This is the linear representation hypothesis. When a network represents “this text is in French,” it does so by having the activation vector point (at least partially) in the “French” direction. The more French-like the input, the more the vector points that way.

Confidence Level: MEDIUM-HIGH

Evidence for: Linear probes work. Steering works. Vector arithmetic works. SAEs find interpretable directions.

Evidence against: Some features may be nonlinear or context-dependent. Feature absorption suggests the picture is more complex. High-level reasoning may not decompose linearly.

This claim might be wrong if: Future work finds that important representations are irreducibly distributed or curved in activation space.

We’re confident this is approximately true for many features, especially concrete concepts. We’re less confident it holds for abstract reasoning.

More precisely: - Each feature corresponds to a direction (a unit vector) in activation space - The presence of a feature means the activation has a positive component along that direction - The intensity of a feature corresponds to the magnitude of that component

If you want to detect whether an activation encodes “French-ness,” you project the activation onto the French direction and check if the result is positive (and how large it is).

Why directions, not individual dimensions?

Because neurons are just one arbitrary basis for the space. The network doesn’t “know” which dimension is which—it just learns to produce useful activation vectors. The meaningful structure is in the geometry, not in the axis labels.

Think of it like coordinates on a map. You can use latitude/longitude, or UTM coordinates, or any rotated coordinate system. The cities don’t move—only the numbers you use to describe them change. Features are like cities: they’re geometric facts about the space, independent of which coordinate system you choose.

9.5 Evidence for Linear Features

The linear representation hypothesis isn’t just elegant—it’s empirically supported.

9.5.1 Linear Probes Succeed

If you train a simple linear classifier on neural network activations to predict some property (is this text toxic? is this image a dog? is this code correct?), it often works remarkably well.

A linear classifier draws a hyperplane through activation space. Points on one side are “yes,” points on the other are “no.” High accuracy means the property is linearly separable—which means it corresponds to a direction (the normal to the hyperplane).

If features weren’t linear, linear probes would fail. They don’t.

9.5.2 Steering Works

When researchers at Anthropic identified a “Golden Gate Bridge” direction in Claude’s activation space and added to it during inference, the model started obsessing over the Golden Gate Bridge. It would bring up the bridge in unrelated contexts, even identify as the bridge when asked about itself.

This worked because the feature is linear: adding to that direction increases the feature’s influence on the model’s behavior. If features were encoded nonlinearly, vector addition wouldn’t have predictable effects.

9.5.3 Vector Arithmetic Works

As we saw in Chapter 4, word embeddings support arithmetic: king - man + woman ≈ queen. This works because gender is a linear direction—you can add and subtract it.

The same holds in transformer hidden states. Researchers have found directions for concepts like “truthfulness,” “refusal,” and “helpfulness.” Adding these directions to activations shifts model behavior in the expected direction.

9.6 Monosemantic vs. Polysemantic

With features defined, we can be precise about the problem we face.

A monosemantic neuron or feature corresponds to a single, interpretable concept. It activates when that concept is present and doesn’t activate when it’s absent. The cat-face neuron that only responds to cat faces would be monosemantic.

A polysemantic neuron corresponds to multiple unrelated concepts. It activates for cat faces and car fronts and maybe several other things. Most neurons in trained networks are polysemantic.

Here’s the key insight: even when neurons are polysemantic, features can be monosemantic.

The features are directions in activation space. Multiple features can be encoded in overlapping patterns across neurons. Each feature, as a direction, might be interpretable—“this direction means cat faces,” “that direction means car fronts.” But any individual neuron participates in multiple features, so looking at neurons individually gives you the confusing polysemantic picture.

Pause and Think

If you found a neuron that activates for both “Paris” and “pizza,” what are two possible explanations? How would you distinguish between them?

Hint: One explanation involves features, the other involves superposition.

flowchart LR
    subgraph Neurons["Neurons (Polysemantic)"]
        N1["Neuron 1<br/>🐱 + 🚗 + 🔵"]
        N2["Neuron 2<br/>🐱 + 🌲 + 🔵"]
        N3["Neuron 3<br/>🚗 + 🌲"]
    end

    subgraph Features["Features (Monosemantic)"]
        F1["Cat 🐱"]
        F2["Car 🚗"]
        F3["Tree 🌲"]
        F4["Blue 🔵"]
    end

    N1 --> F1
    N1 --> F2
    N1 --> F4
    N2 --> F1
    N2 --> F3
    N2 --> F4
    N3 --> F2
    N3 --> F3

Polysemantic neurons vs monosemantic features: each neuron participates in multiple features, but each feature direction is interpretable.

The Decomposition Goal

The goal of mechanistic interpretability is to decompose polysemantic neurons into monosemantic features. We want to find the directions that correspond to interpretable concepts, even though the neurons that encode those directions are a tangled mess.

Common Misconception: Neurons = Features

Wrong: “Each neuron corresponds to a concept.”

Right: Neurons are an arbitrary basis for activation space. Features are learned directions that may not align with any single neuron. A “cat” feature might involve small contributions from hundreds of neurons, none of which individually mean “cat.”

This is why early neural network interpretation (looking at individual neurons) often failed—it was looking in the wrong basis.

9.7 Concrete Examples: Features in the Wild

Let’s look at actual features discovered in real models.

9.7.1 The Golden Gate Bridge

In 2024, Anthropic researchers trained sparse autoencoders (more on these in Chapter 9) on Claude 3 Sonnet and discovered millions of interpretable features. One was “The Golden Gate Bridge.”

This feature: - Activated strongly on text mentioning the Golden Gate Bridge - Activated on images of the bridge (despite being trained only on text!) - When amplified, caused the model to insert the bridge into unrelated conversations - When measured for downstream effects, produced exactly the behavioral changes you’d expect

This is a monosemantic feature: one concept, one direction.

9.7.2 Safety Features

The same research discovered features related to safety and alignment: - “Backdoor” (activates on discussions of hidden malicious functionality) - “Unsafe code” (activates on code with security vulnerabilities) - “Deception” (activates on text involving lies or manipulation) - “Sycophancy” (activates when the model is being excessively agreeable)

These features cluster together in activation space—they’re geometrically nearby, suggesting the model has learned a “safety-relevant” region of its representation space.

9.7.3 Domain Knowledge Features

Features also capture specialized knowledge: - Immunology features that activate on discussions of immune responses - Legal features that activate on contract language - Programming features that activate on specific coding patterns

These features show that the model’s “knowledge” is organized geometrically, with related concepts nearby.

Try It Yourself: Explore Real Features

Neuronpedia lets you explore features discovered by sparse autoencoders in real models. Try:

Browse GPT-2 features at neuronpedia.org/gpt2-small
Search for a concept you’re interested in (try “code”, “emotion”, or “legal”)
Click a feature to see its max-activating examples
Notice how each feature responds to a coherent concept—this is monosemanticity in action

Spend 10 minutes exploring. You’ll develop intuition for what “feature as direction” means in practice.

9.7.4 Induction Features

Some features correspond to computational patterns, not just content: - Features that activate when the model detects a repeated sequence - Features that activate when the model is about to copy from earlier in the context - Features that signal “this is the kind of situation where I should look back at previous examples”

These are features about how to process, not just what is being processed.

9.8 Feature Hierarchies

Features don’t exist in isolation. They form hierarchies and families.

9.8.1 Feature Splitting

When you train larger sparse autoencoders (with more capacity to represent features), broader features “split” into more specific ones.

A small autoencoder might have a single feature for “text starting with S.” A larger one might split this into: - “Text starting with uppercase S” - “Text starting with lowercase s” - “The specific word ‘short’” - “The specific word ‘system’”

The broader feature isn’t wrong—it’s just less specific. More capacity reveals finer-grained structure.

9.8.2 Feature Absorption

The reverse also happens: specific features can “absorb” broader ones. A very specific feature for the word “Paris” might absorb the broader feature for “European capitals” because Paris is so common that it gets its own dedicated direction.

9.8.3 Compositional Structure

Features combine. “French text about cooking” might activate both a “French” feature and a “cooking” feature. The linear representation hypothesis predicts that the activation vector is (roughly) the sum of the individual feature vectors.

This compositional structure is what makes features useful: you don’t need a separate feature for every possible combination. You need base features that combine.

Like Atoms, But Fuzzier

The “atoms of representation” metaphor is apt but imperfect. Atoms combine in precise ways; features combine approximately. Atoms are countable; the number of features is unclear. But the basic idea—simple units that compose into complex structures—transfers.

9.9 The Ontology Question

Here’s a deep question the field is still grappling with: does the network have a “natural” set of features, or is the feature decomposition observer-dependent?

9.9.1 The Case for Natural Features

Some features seem obviously real: - The “Golden Gate Bridge” feature transfers across modalities (text to images) - Manipulating features produces predictable behavioral changes - Similar features cluster together without being told to

This suggests features aren’t arbitrary—they reflect something genuine about how the network represents information.

9.9.2 The Case for Observer-Dependence

But there are troubling observations: - Different analysis methods (different sparse autoencoders, different training objectives) yield different features - Feature splitting means the “right” granularity depends on your analysis capacity - We optimize for human interpretability—are we finding our concepts or the network’s concepts?

A skeptic might argue: any high-dimensional space can be decomposed into directions. We’re finding directions that we find meaningful, but the network might not “care” about those specific directions.

9.9.3 The Pragmatic View

The honest answer is: we don’t fully know.

What we can say: - Some features are clearly real—they have causal power, transfer across contexts, cluster meaningfully - Some features may be artifacts of our analysis methods - The distinction between “natural” and “observer-dependent” may be a spectrum, not a binary

For practical purposes, we work with features that: 1. Activate consistently on semantically related inputs 2. Produce predictable effects when manipulated 3. Compose in expected ways

These are the features that matter for interpretation, regardless of metaphysical status.

A Foundational Uncertainty

The field of mechanistic interpretability is built on a concept—“feature”—that lacks a rigorous formal definition. This is uncomfortable. But it’s also where we are. Progress has been made with working definitions; deeper foundations may come later.

9.10 Polya’s Perspective: Identifying the Unknown

In Polya’s framework, we’ve now identified what we’re looking for.

Chapter 1 asked: what are we trying to understand? (The algorithm.) Chapters 2-4 established: what’s the computational substrate? (Matrix multiplications in a shared residual stream with geometric structure.)

Now, Chapter 5 answers: what are the atoms of that structure? Features.

This is Polya’s step of “identifying the unknown.” We don’t yet know which features a network uses, or how to find them efficiently, or how they combine into algorithms. But we know what we’re looking for: directions in activation space that correspond to interpretable properties.

Polya’s Heuristic: Name the Unknown

Polya emphasized naming things precisely. “A problem well-stated is half-solved.” By defining features—and distinguishing them from neurons—we’ve clarified what success looks like: finding the monosemantic directions that the network uses to represent information.

9.11 Looking Ahead

We’ve defined features and established that they’re directions in activation space. But we’ve also seen a problem: networks seem to represent more features than they have dimensions. A 768-dimensional residual stream encodes thousands (maybe millions) of features.

How is this possible?

The answer is superposition: features are almost orthogonal directions, packed more densely than true orthogonality would allow. The geometry of high dimensions makes this work—but it also makes interpretation hard.

The next chapter dives into superposition: what it is, why it exists, and why it’s the central obstacle to mechanistic interpretability.

Here’s the question to carry forward: if the network packs more features than dimensions by using almost-orthogonal directions, what happens when features do interfere? When does superposition break down, and what are the consequences?

9.12 Key Takeaways

📋 Summary Card

┌────────────────────────────────────────────────────────────┐
│  FEATURES: THE ATOMS OF REPRESENTATION                     │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  NEURON ≠ FEATURE                                          │
│    • Neurons are polysemantic (respond to many concepts)   │
│    • Features are monosemantic (one interpretable concept) │
│                                                            │
│  FEATURES ARE DIRECTIONS                                   │
│    • Each feature = a direction in activation space        │
│    • Presence = positive projection onto that direction    │
│    • Intensity = magnitude of projection                   │
│                                                            │
│  EVIDENCE IT WORKS:                                        │
│    ✓ Linear probes succeed (features are linear)           │
│    ✓ Steering works (add direction → change behavior)      │
│    ✓ Vector arithmetic works (king - man + woman ≈ queen)  │
│                                                            │
│  THE GOAL: Decompose polysemantic neurons into             │
│            monosemantic feature directions                 │
│                                                            │
└────────────────────────────────────────────────────────────┘

9.13 Check Your Understanding

Question 1: A neuron activates for both “cat faces” and “car fronts.” Why isn’t this neuron a good unit of analysis?

Answer: Because it’s polysemantic—it responds to multiple unrelated concepts. Looking at this neuron tells you a confusing mix of information. The neuron participates in representing multiple features (a “cat” feature and a “car” feature), but each feature is a direction in the full activation space, not a single neuron. To understand what the network represents, we need to find the feature directions, not study individual neurons.

Question 2: What does it mean for a feature to be a “direction” in activation space?

Answer: A feature corresponds to a unit vector (a direction) in the high-dimensional space of activations. When an input has that feature, the activation vector points partially in that direction—it has a positive component when projected onto the feature direction. The magnitude of this projection indicates how strongly the feature is present. This is the linear representation hypothesis: semantic properties are encoded as linear directions.

Question 3: Why does “steering” (adding a feature direction to activations) change model behavior predictably?

Answer: Because features are linearly encoded. If the “Golden Gate Bridge” feature is a direction in activation space, then adding to that direction increases the activation’s component along it—making the model act as if the Golden Gate Bridge concept is more present. This only works because of linearity: if features were nonlinearly tangled, simple addition wouldn’t have predictable effects.

9.14 Further Reading

Neel Nanda’s Mechanistic Interpretability Glossary — neelnanda.io: The definitive reference for terminology, including the working definition of “feature.”
Scaling Monosemanticity — Anthropic: Anthropic’s landmark 2024 paper discovering millions of interpretable features in Claude 3 Sonnet, including the Golden Gate Bridge feature.
Toy Models of Superposition — Anthropic: The foundational paper explaining why polysemanticity exists and how features pack into limited dimensions.
Sparse Autoencoders Find Highly Interpretable Features — arXiv:2309.08600: Technical introduction to using sparse autoencoders for feature discovery.
A is for Absorption: Studying Feature Splitting and Absorption — arXiv:2409.14507: Recent work on how features split and merge depending on analysis capacity.
Open Problems in Mechanistic Interpretability — arXiv:2501.16496: Survey of foundational questions, including the lack of a formal feature definition.

--- title: "Features: The Atoms of Representation" subtitle: "What neural networks actually represent" author: "Taras Tsugrii" date: 2025-01-05 categories: [core-theory, features] description: "A feature is a property of the input that the network represents internally. Finding features—not neurons—is the key to understanding what neural networks have learned." --- ::: {.callout-tip} ## What You'll Learn - Why neurons are the wrong unit of analysis (polysemanticity) - What features actually are (directions, not neurons) - The difference between monosemantic and polysemantic representations - The feature ontology problem: what features "should" exist? ::: ::: {.callout-warning} ## Prerequisites **Required**: [Chapter 4: Geometry](04-geometry.qmd) — understanding activations as vectors in high-dimensional space ::: ::: {.callout-note} ## Before You Read: Recall From Chapters 1-4 (Arc I: Foundations), recall: - Activations are vectors in high-dimensional space (Chapter 4) - Vector arithmetic reveals meaningful structure: king - man + woman ≈ queen (Chapter 4) - The residual stream accumulates contributions additively (Chapter 3) - Transformers are matrix multiplication machines (Chapter 2) We've built the computational substrate. **Now we ask**: What are the atoms of meaning in this space? ::: ## The Neuron That Liked Cats and Cars In 2017, researchers at Google were trying to understand what individual neurons in an image classifier had learned. The approach seemed natural: look at which images maximally activate each neuron, and you'll learn what that neuron "means." For some neurons, this worked beautifully. One neuron activated strongly for dog faces. Another for horizontal lines. Another for the color orange. These were *interpretable*—you could describe what they detected. Then they found a neuron that activated for: - Cat faces - Cat legs - The fronts of cars These are not related concepts. There's no obvious property shared by cat anatomy and automobile design. The researchers tested whether the neuron might be detecting some abstract quality like "curvedness" or "sleekness." It wasn't—curved objects like snakes and ferrets didn't activate it. The neuron simply responded to cat parts *and* car fronts. It was **polysemantic**—encoding multiple unrelated features in a single unit. ::: {.callout-important} ## The Core Problem This isn't a rare bug. It's the common case. Most neurons in trained neural networks are polysemantic—they respond to multiple, semantically unrelated features. The neuron is not the right unit of analysis. ::: This chapter introduces the concept that replaces the neuron: the **feature**. Understanding what features are, why they matter, and why they're hard to find is the foundation of mechanistic interpretability. ## Why Neurons Fail Let's understand *why* polysemanticity exists before defining what replaces it. Consider a simple thought experiment. Suppose you're training a network to recognize 10,000 different concepts: animals, objects, colors, textures, emotions, actions, and so on. Your network has 1,000 neurons in a given layer. If each neuron encoded exactly one concept, you could represent at most 1,000 concepts. But you need 10,000. What does the network do? It **shares neurons across concepts**. Multiple features get encoded in overlapping patterns of neuron activations. A single neuron might participate in representing "cat," "car," "curve," and "corporate logo"—not because these are related, but because the network needs to pack more information than it has dedicated slots for. This is efficient. It works because most features rarely co-occur: you don't often see cat faces and car fronts in the same image, so using the same neuron for both doesn't cause problems during inference. The network gets away with the overlap. But it's a disaster for interpretation. When you ask "what does neuron 847 mean?", the answer is "it means cat faces *and* car fronts *and* probably several other things." The neuron doesn't have a single meaning. It's a superposition of meanings (we'll explore this phenomenon deeply in [Chapter 6](06-superposition.qmd)). ::: {.callout-note} ## Incidental Polysemanticity Recent research has revealed something even more striking: polysemanticity can emerge even when there's no capacity pressure. Networks sometimes assign multiple unrelated features to the same neuron purely due to random initialization and training dynamics. Polysemanticity isn't just about efficiency—it's a fundamental property of how neural networks learn. ::: ::: {.callout-warning collapse="true"} ## Try It Yourself: Watch Neuron Interpretation Fail If you want to experience this frustration firsthand, try this in a notebook: ```python import transformer_lens as tl model = tl.HookedTransformer.from_pretrained("gpt2-small") # Pick a random MLP neuron layer, neuron = 6, 1234 # Collect texts that highly activate this neuron test_texts = [ "The weather today is beautiful and sunny", "She walked to the store to buy groceries", "The president announced a new policy", "I love programming in Python", "The cat sat on the mat", # ... add more diverse texts ] for text in test_texts: tokens = model.to_tokens(text) _, cache = model.run_with_cache(tokens) # Get MLP neuron activation (after ReLU/GELU) mlp_acts = cache["post", layer][0, :, neuron] # All positions max_act = mlp_acts.max().item() max_pos = mlp_acts.argmax().item() max_token = model.to_str_tokens(text)[max_pos] if max_act > 1.0: # Threshold for "high activation" print(f"Act={max_act:.2f} at '{max_token}' in: {text[:50]}...") ``` **What you'll likely find**: The neuron activates for seemingly random words across unrelated texts. There's no clean interpretation. *This is the problem.* Now compare with an SAE feature (we'll cover this in Chapter 9): ```python from sae_lens import SAE sae, _, _ = SAE.from_pretrained("gpt2-small-res-jb", "blocks.6.hook_resid_pre") # Feature 4521 might be "words related to science" # Max-activating examples will be coherent ``` The SAE feature will have a clean interpretation. The raw neuron won't. That's the difference between neurons and features. ::: ## What Is a Feature? If neurons aren't the right unit, what is? A **feature** is a property of the input that the network represents internally. Let's unpack this: - **Property of the input**: Features describe something about what the network is processing. "Is this an animal?" "Is this code?" "Is this sentence expressing anger?" "Does this image contain curves?" - **Represented internally**: The network has learned to detect this property. When the property is present, something specific happens in the network's activations. When it's absent, that something doesn't happen. Examples of features: - "This text is in French" - "This image contains a face" - "This code has a bug" - "This sentence is sarcastic" - "The current token follows the pattern [A][B]...[A], predicting [B]" Features range from low-level ("this is a curve," "this token starts with 'S'") to high-level ("this text discusses deception," "this image is the Golden Gate Bridge"). ::: {.callout-tip} ## A Performance Engineering Parallel Features are like the concepts in a profiler's ontology. When you profile a system, you don't think about individual CPU cycles—you think about "database queries," "network I/O," "garbage collection." These are higher-level features of the computation. Similarly, mechanistic interpretability seeks the higher-level features of neural computation, not the individual neuron activations. ::: ## Features as Directions In Chapter 4, we established that neural network activations are vectors in high-dimensional space, and that geometric structure carries meaning. Here's the key claim: **features are directions in activation space**. This is the **linear representation hypothesis**. When a network represents "this text is in French," it does so by having the activation vector point (at least partially) in the "French" direction. The more French-like the input, the more the vector points that way. ::: {.callout-note} ## Confidence Level: MEDIUM-HIGH **Evidence for**: Linear probes work. Steering works. Vector arithmetic works. SAEs find interpretable directions. **Evidence against**: Some features may be nonlinear or context-dependent. Feature absorption suggests the picture is more complex. High-level reasoning may not decompose linearly. **This claim might be wrong if**: Future work finds that important representations are irreducibly distributed or curved in activation space. We're confident this is *approximately* true for many features, especially concrete concepts. We're less confident it holds for abstract reasoning. ::: More precisely: - Each feature corresponds to a direction (a unit vector) in activation space - The *presence* of a feature means the activation has a positive component along that direction - The *intensity* of a feature corresponds to the magnitude of that component If you want to detect whether an activation encodes "French-ness," you project the activation onto the French direction and check if the result is positive (and how large it is). **Why directions, not individual dimensions?** Because neurons are just one arbitrary basis for the space. The network doesn't "know" which dimension is which—it just learns to produce useful activation vectors. The meaningful structure is in the geometry, not in the axis labels. Think of it like coordinates on a map. You can use latitude/longitude, or UTM coordinates, or any rotated coordinate system. The cities don't move—only the numbers you use to describe them change. Features are like cities: they're geometric facts about the space, independent of which coordinate system you choose. ## Evidence for Linear Features The linear representation hypothesis isn't just elegant—it's empirically supported. ### Linear Probes Succeed If you train a simple linear classifier on neural network activations to predict some property (is this text toxic? is this image a dog? is this code correct?), it often works remarkably well. A linear classifier draws a hyperplane through activation space. Points on one side are "yes," points on the other are "no." High accuracy means the property is **linearly separable**—which means it corresponds to a direction (the normal to the hyperplane). If features weren't linear, linear probes would fail. They don't. ### Steering Works When researchers at Anthropic identified a "Golden Gate Bridge" direction in Claude's activation space and added to it during inference, the model started obsessing over the Golden Gate Bridge. It would bring up the bridge in unrelated contexts, even identify *as* the bridge when asked about itself. This worked because the feature is linear: adding to that direction increases the feature's influence on the model's behavior. If features were encoded nonlinearly, vector addition wouldn't have predictable effects. ### Vector Arithmetic Works As we saw in Chapter 4, word embeddings support arithmetic: king - man + woman ≈ queen. This works because gender is a linear direction—you can add and subtract it. The same holds in transformer hidden states. Researchers have found directions for concepts like "truthfulness," "refusal," and "helpfulness." Adding these directions to activations shifts model behavior in the expected direction. ## Monosemantic vs. Polysemantic With features defined, we can be precise about the problem we face. A **monosemantic** neuron or feature corresponds to a single, interpretable concept. It activates when that concept is present and doesn't activate when it's absent. The cat-face neuron that *only* responds to cat faces would be monosemantic. A **polysemantic** neuron corresponds to multiple unrelated concepts. It activates for cat faces *and* car fronts *and* maybe several other things. Most neurons in trained networks are polysemantic. Here's the key insight: **even when neurons are polysemantic, features can be monosemantic**. The features are directions in activation space. Multiple features can be encoded in overlapping patterns across neurons. Each feature, as a direction, might be interpretable—"this direction means cat faces," "that direction means car fronts." But any individual neuron participates in multiple features, so looking at neurons individually gives you the confusing polysemantic picture. ::: {.callout-warning} ## Pause and Think If you found a neuron that activates for both "Paris" and "pizza," what are two possible explanations? How would you distinguish between them? *Hint*: One explanation involves features, the other involves superposition. ::: ```{mermaid} %%| fig-cap: "Polysemantic neurons vs monosemantic features: each neuron participates in multiple features, but each feature direction is interpretable." %%| fig-width: 8 flowchart LR subgraph Neurons["Neurons (Polysemantic)"] N1["Neuron 1<br/>🐱 + 🚗 + 🔵"] N2["Neuron 2<br/>🐱 + 🌲 + 🔵"] N3["Neuron 3<br/>🚗 + 🌲"] end subgraph Features["Features (Monosemantic)"] F1["Cat 🐱"] F2["Car 🚗"] F3["Tree 🌲"] F4["Blue 🔵"] end N1 --> F1 N1 --> F2 N1 --> F4 N2 --> F1 N2 --> F3 N2 --> F4 N3 --> F2 N3 --> F3 ``` ::: {.callout-note} ## The Decomposition Goal The goal of mechanistic interpretability is to decompose polysemantic neurons into monosemantic features. We want to find the directions that correspond to interpretable concepts, even though the neurons that encode those directions are a tangled mess. ::: ::: {.callout-caution} ## Common Misconception: Neurons = Features **Wrong**: "Each neuron corresponds to a concept." **Right**: Neurons are an *arbitrary basis* for activation space. Features are *learned directions* that may not align with any single neuron. A "cat" feature might involve small contributions from hundreds of neurons, none of which individually mean "cat." This is why early neural network interpretation (looking at individual neurons) often failed—it was looking in the wrong basis. ::: ## Concrete Examples: Features in the Wild Let's look at actual features discovered in real models. ### The Golden Gate Bridge In 2024, Anthropic researchers trained sparse autoencoders (more on these in Chapter 9) on Claude 3 Sonnet and discovered millions of interpretable features. One was "The Golden Gate Bridge." This feature: - Activated strongly on text mentioning the Golden Gate Bridge - Activated on images of the bridge (despite being trained only on text!) - When amplified, caused the model to insert the bridge into unrelated conversations - When measured for downstream effects, produced exactly the behavioral changes you'd expect This is a monosemantic feature: one concept, one direction. ### Safety Features The same research discovered features related to safety and alignment: - "Backdoor" (activates on discussions of hidden malicious functionality) - "Unsafe code" (activates on code with security vulnerabilities) - "Deception" (activates on text involving lies or manipulation) - "Sycophancy" (activates when the model is being excessively agreeable) These features cluster together in activation space—they're geometrically nearby, suggesting the model has learned a "safety-relevant" region of its representation space. ### Domain Knowledge Features Features also capture specialized knowledge: - Immunology features that activate on discussions of immune responses - Legal features that activate on contract language - Programming features that activate on specific coding patterns These features show that the model's "knowledge" is organized geometrically, with related concepts nearby. ::: {.callout-note} ## Try It Yourself: Explore Real Features [Neuronpedia](https://www.neuronpedia.org/) lets you explore features discovered by sparse autoencoders in real models. Try: 1. Browse GPT-2 features at [neuronpedia.org/gpt2-small](https://www.neuronpedia.org/gpt2-small) 2. Search for a concept you're interested in (try "code", "emotion", or "legal") 3. Click a feature to see its max-activating examples 4. Notice how each feature responds to a coherent concept—this is monosemanticity in action Spend 10 minutes exploring. You'll develop intuition for what "feature as direction" means in practice. ::: ### Induction Features Some features correspond to computational patterns, not just content: - Features that activate when the model detects a repeated sequence - Features that activate when the model is about to copy from earlier in the context - Features that signal "this is the kind of situation where I should look back at previous examples" These are features about *how* to process, not just *what* is being processed. ## Feature Hierarchies Features don't exist in isolation. They form hierarchies and families. ### Feature Splitting When you train larger sparse autoencoders (with more capacity to represent features), broader features "split" into more specific ones. A small autoencoder might have a single feature for "text starting with S." A larger one might split this into: - "Text starting with uppercase S" - "Text starting with lowercase s" - "The specific word 'short'" - "The specific word 'system'" The broader feature isn't wrong—it's just less specific. More capacity reveals finer-grained structure. ### Feature Absorption The reverse also happens: specific features can "absorb" broader ones. A very specific feature for the word "Paris" might absorb the broader feature for "European capitals" because Paris is so common that it gets its own dedicated direction. ### Compositional Structure Features combine. "French text about cooking" might activate both a "French" feature and a "cooking" feature. The linear representation hypothesis predicts that the activation vector is (roughly) the sum of the individual feature vectors. This compositional structure is what makes features useful: you don't need a separate feature for every possible combination. You need base features that combine. ::: {.callout-tip} ## Like Atoms, But Fuzzier The "atoms of representation" metaphor is apt but imperfect. Atoms combine in precise ways; features combine approximately. Atoms are countable; the number of features is unclear. But the basic idea—simple units that compose into complex structures—transfers. ::: ## The Ontology Question Here's a deep question the field is still grappling with: **does the network have a "natural" set of features, or is the feature decomposition observer-dependent?** ### The Case for Natural Features Some features seem obviously real: - The "Golden Gate Bridge" feature transfers across modalities (text to images) - Manipulating features produces predictable behavioral changes - Similar features cluster together without being told to This suggests features aren't arbitrary—they reflect something genuine about how the network represents information. ### The Case for Observer-Dependence But there are troubling observations: - Different analysis methods (different sparse autoencoders, different training objectives) yield different features - Feature splitting means the "right" granularity depends on your analysis capacity - We optimize for human interpretability—are we finding *our* concepts or *the network's* concepts? A skeptic might argue: any high-dimensional space can be decomposed into directions. We're finding directions that *we* find meaningful, but the network might not "care" about those specific directions. ### The Pragmatic View The honest answer is: we don't fully know. What we *can* say: - *Some* features are clearly real—they have causal power, transfer across contexts, cluster meaningfully - *Some* features may be artifacts of our analysis methods - The distinction between "natural" and "observer-dependent" may be a spectrum, not a binary For practical purposes, we work with features that: 1. Activate consistently on semantically related inputs 2. Produce predictable effects when manipulated 3. Compose in expected ways These are the features that matter for interpretation, regardless of metaphysical status. ::: {.callout-important} ## A Foundational Uncertainty The field of mechanistic interpretability is built on a concept—"feature"—that lacks a rigorous formal definition. This is uncomfortable. But it's also where we are. Progress has been made with working definitions; deeper foundations may come later. ::: ## Polya's Perspective: Identifying the Unknown In Polya's framework, we've now identified what we're looking for. Chapter 1 asked: what are we trying to understand? (The algorithm.) Chapters 2-4 established: what's the computational substrate? (Matrix multiplications in a shared residual stream with geometric structure.) Now, Chapter 5 answers: **what are the atoms of that structure?** Features. This is Polya's step of "identifying the unknown." We don't yet know *which* features a network uses, or *how* to find them efficiently, or *how* they combine into algorithms. But we know what we're looking for: directions in activation space that correspond to interpretable properties. ::: {.callout-tip} ## Polya's Heuristic: Name the Unknown Polya emphasized naming things precisely. "A problem well-stated is half-solved." By defining features—and distinguishing them from neurons—we've clarified what success looks like: finding the monosemantic directions that the network uses to represent information. ::: ## Looking Ahead We've defined features and established that they're directions in activation space. But we've also seen a problem: networks seem to represent more features than they have dimensions. A 768-dimensional residual stream encodes thousands (maybe millions) of features. How is this possible? The answer is **superposition**: features are *almost* orthogonal directions, packed more densely than true orthogonality would allow. The geometry of high dimensions makes this work—but it also makes interpretation hard. The next chapter dives into superposition: what it is, why it exists, and why it's the central obstacle to mechanistic interpretability. Here's the question to carry forward: if the network packs more features than dimensions by using almost-orthogonal directions, what happens when features *do* interfere? When does superposition break down, and what are the consequences? --- ## Key Takeaways ::: {.callout-tip} ## 📋 Summary Card ``` ┌────────────────────────────────────────────────────────────┐ │ FEATURES: THE ATOMS OF REPRESENTATION │ ├────────────────────────────────────────────────────────────┤ │ │ │ NEURON ≠ FEATURE │ │ • Neurons are polysemantic (respond to many concepts) │ │ • Features are monosemantic (one interpretable concept) │ │ │ │ FEATURES ARE DIRECTIONS │ │ • Each feature = a direction in activation space │ │ • Presence = positive projection onto that direction │ │ • Intensity = magnitude of projection │ │ │ │ EVIDENCE IT WORKS: │ │ ✓ Linear probes succeed (features are linear) │ │ ✓ Steering works (add direction → change behavior) │ │ ✓ Vector arithmetic works (king - man + woman ≈ queen) │ │ │ │ THE GOAL: Decompose polysemantic neurons into │ │ monosemantic feature directions │ │ │ └────────────────────────────────────────────────────────────┘ ``` ::: ## Check Your Understanding ::: {.callout-note collapse="true"} ## Question 1: A neuron activates for both "cat faces" and "car fronts." Why isn't this neuron a good unit of analysis? **Answer**: Because it's *polysemantic*—it responds to multiple unrelated concepts. Looking at this neuron tells you a confusing mix of information. The neuron participates in representing multiple features (a "cat" feature and a "car" feature), but each feature is a *direction* in the full activation space, not a single neuron. To understand what the network represents, we need to find the feature directions, not study individual neurons. ::: ::: {.callout-note collapse="true"} ## Question 2: What does it mean for a feature to be a "direction" in activation space? **Answer**: A feature corresponds to a unit vector (a direction) in the high-dimensional space of activations. When an input has that feature, the activation vector points partially in that direction—it has a positive component when projected onto the feature direction. The magnitude of this projection indicates how strongly the feature is present. This is the *linear representation hypothesis*: semantic properties are encoded as linear directions. ::: ::: {.callout-note collapse="true"} ## Question 3: Why does "steering" (adding a feature direction to activations) change model behavior predictably? **Answer**: Because features are linearly encoded. If the "Golden Gate Bridge" feature is a direction in activation space, then adding to that direction increases the activation's component along it—making the model act as if the Golden Gate Bridge concept is more present. This only works because of linearity: if features were nonlinearly tangled, simple addition wouldn't have predictable effects. ::: --- ## Further Reading 1. **Neel Nanda's Mechanistic Interpretability Glossary** — [neelnanda.io](https://www.neelnanda.io/mechanistic-interpretability/glossary): The definitive reference for terminology, including the working definition of "feature." 2. **Scaling Monosemanticity** — [Anthropic](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html): Anthropic's landmark 2024 paper discovering millions of interpretable features in Claude 3 Sonnet, including the Golden Gate Bridge feature. 3. **Toy Models of Superposition** — [Anthropic](https://transformer-circuits.pub/2022/toy_model/index.html): The foundational paper explaining why polysemanticity exists and how features pack into limited dimensions. 4. **Sparse Autoencoders Find Highly Interpretable Features** — [arXiv:2309.08600](https://arxiv.org/abs/2309.08600): Technical introduction to using sparse autoencoders for feature discovery. 5. **A is for Absorption: Studying Feature Splitting and Absorption** — [arXiv:2409.14507](https://arxiv.org/abs/2409.14507): Recent work on how features split and merge depending on analysis capacity. 6. **Open Problems in Mechanistic Interpretability** — [arXiv:2501.16496](https://arxiv.org/abs/2501.16496): Survey of foundational questions, including the lack of a formal feature definition.