11 Toy Models

Understanding superposition in miniature

core-theory

toy-models

Author

Taras Tsugrii

Published

January 5, 2025

What You’ll Learn

How to build the simplest possible network that exhibits superposition
Why optimal arrangements form regular polytopes (pentagons, hexagons)
How to observe and verify the phase transition experimentally
Why Polya’s “solve a simpler problem first” heuristic is essential

Prerequisites

Required: Chapter 6: Superposition — understanding the superposition hypothesis and why sparsity matters

Before You Read: Recall

From Chapter 6, recall:

Superposition: networks store more features than dimensions using almost-orthogonal directions
Sparsity enables this: if features rarely co-occur, interference is manageable
Phase transition: at critical sparsity, networks switch to superposition
This is why polysemanticity exists and interpretation is hard

Now we ask: Can we study superposition in a controlled setting where we know the ground truth?

11.1 Solve a Simpler Problem First

In the previous chapter, we established that superposition is the central obstacle to mechanistic interpretability. Networks pack more features than dimensions, creating polysemantic neurons and hiding interpretable structure in a compressed representation.

But superposition in a real language model is hard to study. The model has billions of parameters, millions of features, and we don’t know the ground truth—we don’t know which features should exist or how they should be arranged.

This is where Polya’s heuristic becomes essential: solve a simpler problem first.

What if we built a tiny network where: - We know exactly how many features there are - We control how sparse they are - We can visualize the entire representation space - We can verify whether the network finds the optimal arrangement

This is the toy model approach. Build a system simple enough to fully understand, observe superposition emerge, and use the insights to interpret real networks.

The Core Idea

A toy model is an autoencoder trained to represent $n$ sparse features in $d < n$ dimensions. The network is forced to compress, and we can watch exactly how it chooses to do so.

11.2 The Setup

The Anthropic toy model has a beautifully simple architecture:

Input: A vector of $n$ features, where each feature is either “on” (value 1) or “off” (value 0). Each feature has a probability $S$ of being on—this is the sparsity parameter.

Bottleneck: A hidden layer with only $d$ dimensions, where $d < n$. This forces compression.

Output: A reconstruction of the original $n$-dimensional input.

Training objective: Minimize reconstruction error.

The setup looks like this:

Input (n dims)  →  Hidden (d dims)  →  Output (n dims)
    [n features]      [compressed]       [reconstructed]

For a concrete example: 5 features compressed into 2 dimensions. The network must somehow represent 5 things using only 2 numbers.

11.2.1 Why This Forces Superposition

In 2 dimensions, you can have at most 2 orthogonal directions. If each feature got its own orthogonal direction, you could only represent 2 features perfectly.

But we’re asking for 5 features. The network has a choice: - Represent only the 2 most important features perfectly, ignore the other 3 - Represent all 5 imperfectly, accepting some interference

When features are sparse (rarely active), the second option is better. The network can pack 5 features into 2 dimensions as long as it accepts occasional errors when multiple features are active simultaneously.

This is superposition, and we can watch it emerge.

11.3 The Pentagon Discovery

Here’s what happens when you train this toy model with 5 sparse features in 2 dimensions:

The network learns to represent the 5 features as 5 directions arranged in a regular pentagon.

                F1
                ↑
               /|\
              / | \
          F5 ←--●--→ F2
              \ | /
               \|/
             ↙   ↘
           F4     F3

Each feature gets a direction. Adjacent features are 72° apart (360° / 5). No two features are orthogonal, but they’re spread as far apart as possible.

Why a Pentagon?

The pentagon arrangement minimizes the maximum interference between any two features. Any other arrangement would have some pair of features closer together, causing more interference when both are active. Gradient descent discovers this optimal geometry automatically.

11.3.1 Measuring the Geometry

We can verify the pentagon arrangement directly:

Extract the weight matrix $W$ that maps inputs to the hidden layer
Each column of $W$ is the direction for one feature
Compute the angle between each pair of feature directions
Observe that adjacent features are 72° apart, opposite features are 144° apart

The network has found the mathematically optimal arrangement without being told what it is. This is gradient descent discovering geometry.

Code

import numpy as np
import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(6, 6))

# 5 features as a pentagon (72° spacing)
n_features = 5
angles = np.linspace(np.pi/2, np.pi/2 + 2*np.pi, n_features + 1)[:-1]  # Start from top

colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00']

for i, angle in enumerate(angles):
    # Draw arrow
    ax.annotate('', xy=(np.cos(angle), np.sin(angle)), xytext=(0, 0),
                arrowprops=dict(arrowstyle='->', color=colors[i], lw=3))
    # Label
    ax.text(1.2*np.cos(angle), 1.2*np.sin(angle), f'Feature {i+1}',
            ha='center', va='center', fontsize=11, fontweight='bold', color=colors[i])

# Draw the pentagon outline
pentagon_x = [np.cos(a) for a in angles] + [np.cos(angles[0])]
pentagon_y = [np.sin(a) for a in angles] + [np.sin(angles[0])]
ax.plot(pentagon_x, pentagon_y, 'k--', alpha=0.3, lw=1)

# Angle annotation
ax.annotate('', xy=(0.4*np.cos(angles[0]), 0.4*np.sin(angles[0])),
            xytext=(0.4*np.cos(angles[1]), 0.4*np.sin(angles[1])),
            arrowprops=dict(arrowstyle='<->', color='gray', lw=1))
ax.text(0.5, 0.35, '72°', fontsize=10, color='gray')

ax.set_xlim(-1.5, 1.5)
ax.set_ylim(-1.5, 1.5)
ax.set_aspect('equal')
ax.axhline(y=0, color='gray', linestyle='-', alpha=0.2)
ax.axvline(x=0, color='gray', linestyle='-', alpha=0.2)
ax.set_title('Pentagon Arrangement: 5 Features in 2D', fontsize=14)
ax.set_xlabel('Dimension 1')
ax.set_ylabel('Dimension 2')
ax.grid(True, alpha=0.2)

plt.tight_layout()
plt.show()

Figure 11.1: 5 features arranged as a pentagon in 2D. Adjacent features are 72° apart—the optimal arrangement that minimizes interference.

11.3.2 Interactive: Explore Polytope Arrangements

Use the slider below to see how different numbers of features arrange themselves in 2D. Notice how the angles between features change as you add more.

Code

{
  const width = 400;
  const height = 400;
  const radius = 150;
  const centerX = width / 2;
  const centerY = height / 2;

  // Calculate feature positions
  const features = [];
  for (let i = 0; i < numFeatures; i++) {
    const angle = (Math.PI / 2) + (2 * Math.PI * i / numFeatures);
    features.push({
      x: centerX + radius * Math.cos(angle),
      y: centerY - radius * Math.sin(angle),
      angle: angle,
      index: i + 1
    });
  }

  // Calculate angle between adjacent features
  const angleDegrees = (360 / numFeatures).toFixed(1);
  const cosineSimAdj = Math.cos(2 * Math.PI / numFeatures).toFixed(3);

  // Colors for features
  const colors = ["#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00",
                  "#f781bf", "#a65628", "#999999", "#66c2a5", "#fc8d62",
                  "#8da0cb", "#e78ac3"];

  const svg = d3.create("svg")
    .attr("width", width)
    .attr("height", height)
    .attr("viewBox", [0, 0, width, height]);

  // Draw axes
  svg.append("line")
    .attr("x1", 20).attr("y1", centerY)
    .attr("x2", width - 20).attr("y2", centerY)
    .attr("stroke", "#ccc").attr("stroke-width", 1);
  svg.append("line")
    .attr("x1", centerX).attr("y1", 20)
    .attr("x2", centerX).attr("y2", height - 20)
    .attr("stroke", "#ccc").attr("stroke-width", 1);

  // Draw polygon outline
  const polygonPoints = features.map(f => `${f.x},${f.y}`).join(" ");
  svg.append("polygon")
    .attr("points", polygonPoints)
    .attr("fill", "none")
    .attr("stroke", "#333")
    .attr("stroke-width", 1)
    .attr("stroke-dasharray", "4,2")
    .attr("opacity", 0.3);

  // Draw feature vectors
  features.forEach((f, i) => {
    // Arrow line
    svg.append("line")
      .attr("x1", centerX).attr("y1", centerY)
      .attr("x2", f.x).attr("y2", f.y)
      .attr("stroke", colors[i % colors.length])
      .attr("stroke-width", 3);

    // Arrowhead
    const arrowSize = 10;
    const angle = Math.atan2(centerY - f.y, f.x - centerX);
    svg.append("polygon")
      .attr("points", `${f.x},${f.y} ${f.x - arrowSize * Math.cos(angle - 0.3)},${f.y + arrowSize * Math.sin(angle - 0.3)} ${f.x - arrowSize * Math.cos(angle + 0.3)},${f.y + arrowSize * Math.sin(angle + 0.3)}`)
      .attr("fill", colors[i % colors.length]);

    // Label
    const labelRadius = radius + 25;
    svg.append("text")
      .attr("x", centerX + labelRadius * Math.cos(f.angle))
      .attr("y", centerY - labelRadius * Math.sin(f.angle))
      .attr("text-anchor", "middle")
      .attr("dominant-baseline", "middle")
      .attr("font-size", "12px")
      .attr("font-weight", "bold")
      .attr("fill", colors[i % colors.length])
      .text(`F${f.index}`);
  });

  // Info text
  svg.append("text")
    .attr("x", centerX).attr("y", height - 40)
    .attr("text-anchor", "middle")
    .attr("font-size", "14px")
    .text(`${numFeatures} features → ${angleDegrees}° between adjacent`);

  svg.append("text")
    .attr("x", centerX).attr("y", height - 20)
    .attr("text-anchor", "middle")
    .attr("font-size", "12px")
    .attr("fill", "#666")
    .text(`Adjacent cosine similarity: ${cosineSimAdj}`);

  return svg.node();
}

What to Notice

As you add more features, the angle between adjacent features shrinks
More features → higher cosine similarity → more interference when multiple are active
At some point (depending on sparsity), adding more features isn’t worth the interference cost
This is why the phase transition exists: there’s a limit to useful superposition

11.4 Phase Transitions

The most striking finding from toy model experiments is the phase transition.

As you vary the sparsity parameter $S$ (how often features are active), the network’s behavior changes discontinuously.

11.4.1 High Sparsity (Rare Features)

When $S$ is small (say, 0.01—each feature active 1% of the time): - Features almost never co-occur - Interference is rare and worth accepting - Network uses heavy superposition - All 5 features get directions, arranged as a pentagon - Reconstruction quality is high except in rare cases

11.4.2 Low Sparsity (Common Features)

When $S$ is large (say, 0.3—each feature active 30% of the time): - Features frequently co-occur - Interference happens constantly - Superposition is too costly - Network dedicates dimensions to the most important features - Less important features are ignored entirely

11.4.3 The Transition

Between these regimes, there’s a sharp threshold. At some critical sparsity $S^*$, the network’s strategy flips:

Below $S^*$: superposed representation (pentagon)
Above $S^*$: dedicated representation (subset of features)

The transition isn’t gradual. Small changes in sparsity near the threshold cause the network to completely reorganize its representation.

Code

import numpy as np
import matplotlib.pyplot as plt

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4))

# Simulated phase transition curve
sparsity = np.linspace(0.01, 0.5, 100)
# Sigmoid-like sharp transition around S* = 0.15
features_represented = 5 / (1 + np.exp(25 * (sparsity - 0.15))) + 2 * (1 - 1/(1 + np.exp(25 * (sparsity - 0.15))))

ax1.plot(sparsity, features_represented, 'b-', lw=2.5)
ax1.axvline(x=0.15, color='red', linestyle='--', alpha=0.7, label='Critical sparsity S*')
ax1.fill_between(sparsity[sparsity < 0.15], 0, 5.5, alpha=0.1, color='green')
ax1.fill_between(sparsity[sparsity >= 0.15], 0, 5.5, alpha=0.1, color='orange')
ax1.text(0.07, 4.8, 'Superposition\n(all 5 features)', fontsize=9, ha='center', color='green')
ax1.text(0.32, 2.5, 'Dedicated\n(2 features)', fontsize=9, ha='center', color='orange')
ax1.set_xlabel('Feature Activation Probability (S)', fontsize=11)
ax1.set_ylabel('Features Represented', fontsize=11)
ax1.set_title('Phase Transition: Features vs Sparsity', fontsize=12)
ax1.set_ylim(0, 5.5)
ax1.set_xlim(0, 0.5)
ax1.legend(loc='upper right')
ax1.grid(True, alpha=0.3)

# Reconstruction loss
loss_superposed = 0.02 + 0.5 * sparsity**2  # Low base + interference at high density
loss_dedicated = 0.15 * np.ones_like(sparsity)  # Constant (3/5 features missing)
optimal_loss = np.minimum(loss_superposed, loss_dedicated)

ax2.plot(sparsity, loss_superposed, 'g--', lw=2, label='Superposition strategy', alpha=0.7)
ax2.plot(sparsity, loss_dedicated, 'orange', linestyle='--', lw=2, label='Dedicated strategy', alpha=0.7)
ax2.plot(sparsity, optimal_loss, 'b-', lw=2.5, label='Optimal (network chooses)')
ax2.axvline(x=0.15, color='red', linestyle='--', alpha=0.7)
ax2.set_xlabel('Feature Activation Probability (S)', fontsize=11)
ax2.set_ylabel('Reconstruction Loss', fontsize=11)
ax2.set_title('Why the Transition Happens', fontsize=12)
ax2.set_xlim(0, 0.5)
ax2.legend(loc='upper left', fontsize=9)
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

Figure 11.2: Phase transition in superposition: as sparsity decreases (features become more common), the network suddenly switches from superposed representation to dedicated representation.

A Physical Analogy

Phase transitions in toy models are like phase transitions in physics—water freezing, or magnets aligning. The system doesn’t gradually shift; it snaps from one qualitative state to another. The underlying mathematics is similar: optimization landscapes with multiple local minima, and sudden switches between them.

11.5 Polytope Arrangements

The pentagon is just one example. The general pattern: networks arrange features as vertices of regular polytopes—the higher-dimensional generalizations of regular polygons.

11.5.1 Examples

4 features in 2D: Square (90° between adjacent features) 6 features in 2D: Hexagon (60° between adjacent features) 4 features in 3D: Tetrahedron (109.5° between all pairs) 6 features in 3D: Octahedron

The pattern is consistent: maximize the minimum angle between any two features. This minimizes worst-case interference.

11.5.2 Why Polytopes?

The loss function penalizes reconstruction error. When features interfere (because they’re not orthogonal), reconstruction suffers. The network minimizes total interference by spreading features as far apart as possible.

For $k$ features in $d$ dimensions where $k > d$, the optimal arrangement is the vertices of a regular polytope with $k$ vertices. This is a known result in geometric optimization, and neural networks rediscover it through gradient descent.

Emergent Optimality

No one programs these arrangements. The network simply tries to minimize reconstruction error, and the geometry falls out. This suggests that optimization pressure is enough to find elegant mathematical solutions—a recurring theme in deep learning.

11.6 The Mathematics

Let’s be more precise about interference and why polytopes are optimal.

11.6.1 Interference Cost

When two features have directions $\mathbf{w}_1$ and $\mathbf{w}_2$ with cosine similarity $c = \mathbf{w}_1 \cdot \mathbf{w}_2$, the interference when both are active is proportional to $c$.

If feature 1 is active and feature 2 should be inactive, but they share direction, the reconstruction will incorrectly activate feature 2 by amount $\approx c$.

The expected interference cost depends on: - How often features co-occur ($\propto S^2$ if independent) - How aligned their directions are ($\propto c^2$ for squared error)

11.6.2 The Trade-off

The network optimizes:

\[\text{Total Loss} = \text{Capacity Loss} + \text{Interference Loss}\]

Capacity Loss: Features that don’t get represented at all contribute to error whenever they’re active
Interference Loss: Superposed features contribute error when they co-occur

At high sparsity, interference loss is low (features rarely co-occur), so the network superimposes aggressively to minimize capacity loss.

At low sparsity, interference loss dominates, so the network reserves dimensions for important features and ignores others.

11.6.3 Optimal Angles

For $k$ features superposed in $d$ dimensions, the optimal arrangement has: - All features equidistant from each other (in angle) - Minimum pairwise cosine similarity

For the pentagon in 2D: $\cos(72°) \approx 0.31$. For the hexagon: $\cos(60°) = 0.5$.

More features means smaller angles, which means more interference. Eventually, the interference cost exceeds the benefit, and it’s better to drop features than to pack more.

11.7 Visualizing Superposition

One of the great virtues of toy models is that we can see what’s happening.

11.7.1 Weight Matrix Visualization

The weight matrix $W$ maps features to hidden dimensions. Each column of $W$ is a feature’s direction in the hidden space.

For 5 features in 2D, $W$ is a 2×5 matrix. Plotting the 5 columns as vectors in 2D directly shows the pentagon arrangement.

11.7.2 Activation Patterns

When a specific input (say, features 1 and 3 both active) passes through the network: 1. The hidden activation is the sum of feature 1’s direction and feature 3’s direction 2. The reconstruction projects this back to all 5 features 3. Features 1 and 3 reconstruct strongly; others reconstruct weakly (but not zero—this is interference)

You can visualize which hidden dimensions “light up” for which features, and see the polysemanticity directly: each hidden dimension participates in multiple features.

11.7.3 Training Dynamics

Watching training in real-time reveals how the network discovers the optimal geometry: 1. Initially, feature directions are random 2. Gradient descent pushes features apart (to reduce interference) 3. Features rotate toward the polytope arrangement 4. The arrangement stabilizes when each feature is equidistant from its neighbors

The optimization process is a geometric dance, and toy models let us watch every step.

11.8 What Transfers to Real Models

The toy model findings aren’t just academic curiosities. They predict real phenomena in large language models.

11.8.1 Superposition Ratio

Anthropic’s work on Claude 3 Sonnet found over 2 million interpretable features in a 4,096-dimensional layer. That’s a ratio of ~500 features per dimension.

Toy models predict exactly this: when features are sparse enough (as they are in natural language), you can pack exponentially more features than dimensions.

11.8.2 Sparsity Matches

Real features are sparse. The “Golden Gate Bridge” feature activates on ~0.05% of inputs. Most semantic features activate on less than 1% of inputs.

This matches the toy model regime where superposition is optimal—sparsity below ~1% enables heavy compression without unbearable interference.

11.8.3 Polysemanticity Explained

The polysemantic neuron from Chapter 5 (cat faces + car fronts) is exactly what toy models predict. Each neuron participates in multiple feature directions. Looking at a single neuron shows a confusing mix; looking at the feature direction shows a clean concept.

11.8.4 Steering Works

Adding a feature direction to activations shifts behavior predictably. This only works because features are linear directions—exactly the geometry that toy models reveal.

The Transfer Principle

Toy models operate by the same mathematical principles as large models: gradient descent finding efficient representations of sparse features in limited dimensions. The numbers differ (5 vs. 2 million features), but the geometry is the same.

11.9 What Doesn’t Transfer

Toy models are deliberately simple. Some aspects don’t generalize.

11.9.1 Ground Truth

In toy models, we know the true features—we created them. In real models, we don’t know which features the model “should” have. We’re inferring features from behavior, not checking against ground truth.

11.9.2 Feature Complexity

Toy model features are binary (on/off) and independent. Real features are continuous, context-dependent, and hierarchically organized. “French” doesn’t just turn on and off; it has degrees and interacts with other concepts.

11.9.3 Architecture

Toy models are simple autoencoders. Real transformers have attention, layer norms, residual connections, and dozens of layers. The compositional structure of transformers creates phenomena (like induction heads, which we’ll cover in Chapter 13) that don’t appear in toy models.

11.9.4 Scale

5 features in 2 dimensions is tractable. 2 million features in 4,096 dimensions is a different beast. The principles may transfer, but the practical challenges of discovery and verification are much harder.

11.9.5 Reconstruction Errors

When sparse autoencoders are trained on real model activations, they show 10-40% reconstruction error. This suggests the clean superposition picture from toy models is only approximate in practice. Real representations are messier.

The Limitation

Toy models teach principles, not details. They explain why superposition exists and how it works geometrically. They don’t tell us exactly which features a real model uses or how to find them efficiently.

11.10 Pedagogy: Why This Matters

Toy models aren’t just research tools. They’re essential for building intuition.

11.10.1 Making the Abstract Concrete

“Features are directions in activation space” is abstract. Seeing a pentagon of 5 features in 2D makes it concrete. You can plot it, measure angles, verify predictions.

11.10.2 Demystifying Optimization

“Gradient descent finds efficient representations” sounds magical. Watching a toy model rotate its weight vectors toward the optimal polytope demystifies it. Optimization is just following the gradient, and the gradient points toward good geometry.

11.10.3 Building Transferable Intuition

Once you understand that 5 features form a pentagon in 2D, you have a mental model for what 2 million features might look like in 4,096D. You can’t visualize it directly, but you understand the kind of structure that exists.

11.10.4 Enabling Experimentation

Anyone can train a toy model in an afternoon. You can vary sparsity, count features, change dimensions, and observe results. This hands-on experimentation builds intuition that reading papers cannot.

11.11 Polya’s Perspective: The Simpler Problem

This chapter is a direct application of Polya’s heuristic: if you can’t solve a problem, solve a simpler problem first.

Superposition in GPT-4 is too complex to study directly. But superposition in a 5-feature, 2-dimensional autoencoder is fully tractable. We can: - Verify that superposition occurs - Observe the optimal geometry - Measure the phase transitions - Predict when superposition will vs. won’t happen

Then we take these insights and apply them to the hard problem. We know what to look for. We know the geometry should be polytope-like. We know sparsity is the key parameter. We know polysemanticity is a cross-section of superposed directions.

Polya’s Insight

“If you can’t solve the problem, look for a related problem that you can solve.” Toy models are that related problem. They share the essential structure (superposition) while removing the incidental complexity (billions of parameters, unknown features, compositional architecture).

11.12 Looking Ahead

We’ve now seen superposition in action—concrete, visual, fully understood.

But features in isolation don’t explain how models compute. A language model doesn’t just represent “French” and “cooking” as separate features; it combines them to process “French cuisine” and predict appropriate continuations.

How do features compose? How do multiple components work together to perform a computation? This is the question of circuits—the subject of the next chapter.

If features are atoms, circuits are molecules. And just as understanding atoms prepared chemists to understand molecules, understanding features prepares us to understand how they combine into computational structures.

11.13 Further Reading

Toy Models of Superposition — Anthropic: The foundational paper. Essential reading with interactive visualizations.
Interactive Toy Model — Mitchell Gordon: Train toy models in your browser and watch superposition emerge in real time.
Toy Models of Superposition: A Walkthrough — Neel Nanda: Step-by-step guide to understanding and reproducing the experiments.
The Persian Rug: Solving Toy Models Analytically — arXiv:2410.12101: Mathematical analysis proving optimality of polytope arrangements.
Dynamical versus Bayesian Phase Transitions — arXiv:2310.06301: Deep dive into the phase transition phenomenon and its mathematical structure.
Scaling Monosemanticity — Anthropic: Applying toy model insights to real models, finding millions of features in Claude.

--- title: "Toy Models" subtitle: "Understanding superposition in miniature" author: "Taras Tsugrii" date: 2025-01-05 categories: [core-theory, toy-models] description: "Before tackling superposition in billion-parameter models, we study it in systems simple enough to fully understand. The insights transfer remarkably well." --- ::: {.callout-tip} ## What You'll Learn - How to build the simplest possible network that exhibits superposition - Why optimal arrangements form regular polytopes (pentagons, hexagons) - How to observe and verify the phase transition experimentally - Why Polya's "solve a simpler problem first" heuristic is essential ::: ::: {.callout-warning} ## Prerequisites **Required**: [Chapter 6: Superposition](06-superposition.qmd) — understanding the superposition hypothesis and why sparsity matters ::: ::: {.callout-note} ## Before You Read: Recall From Chapter 6, recall: - Superposition: networks store more features than dimensions using *almost-orthogonal* directions - Sparsity enables this: if features rarely co-occur, interference is manageable - Phase transition: at critical sparsity, networks switch to superposition - This is why polysemanticity exists and interpretation is hard **Now we ask**: Can we study superposition in a controlled setting where we know the ground truth? ::: ## Solve a Simpler Problem First In the previous chapter, we established that superposition is the central obstacle to mechanistic interpretability. Networks pack more features than dimensions, creating polysemantic neurons and hiding interpretable structure in a compressed representation. But superposition in a real language model is hard to study. The model has billions of parameters, millions of features, and we don't know the ground truth—we don't know which features *should* exist or how they *should* be arranged. This is where Polya's heuristic becomes essential: **solve a simpler problem first**. What if we built a tiny network where: - We know exactly how many features there are - We control how sparse they are - We can visualize the entire representation space - We can verify whether the network finds the optimal arrangement This is the toy model approach. Build a system simple enough to fully understand, observe superposition emerge, and use the insights to interpret real networks. ::: {.callout-note} ## The Core Idea A toy model is an autoencoder trained to represent $n$ sparse features in $d < n$ dimensions. The network is forced to compress, and we can watch exactly how it chooses to do so. ::: ## The Setup The Anthropic toy model has a beautifully simple architecture: **Input**: A vector of $n$ features, where each feature is either "on" (value 1) or "off" (value 0). Each feature has a probability $S$ of being on—this is the sparsity parameter. **Bottleneck**: A hidden layer with only $d$ dimensions, where $d < n$. This forces compression. **Output**: A reconstruction of the original $n$-dimensional input. **Training objective**: Minimize reconstruction error. The setup looks like this: ``` Input (n dims) → Hidden (d dims) → Output (n dims) [n features] [compressed] [reconstructed] ``` For a concrete example: 5 features compressed into 2 dimensions. The network must somehow represent 5 things using only 2 numbers. ### Why This Forces Superposition In 2 dimensions, you can have at most 2 orthogonal directions. If each feature got its own orthogonal direction, you could only represent 2 features perfectly. But we're asking for 5 features. The network has a choice: - Represent only the 2 most important features perfectly, ignore the other 3 - Represent all 5 imperfectly, accepting some interference When features are sparse (rarely active), the second option is better. The network can pack 5 features into 2 dimensions as long as it accepts occasional errors when multiple features are active simultaneously. This is superposition, and we can watch it emerge. ## The Pentagon Discovery Here's what happens when you train this toy model with 5 sparse features in 2 dimensions: The network learns to represent the 5 features as 5 directions arranged in a **regular pentagon**. ``` F1 ↑ /|\ / | \ F5 ←--●--→ F2 \ | / \|/ ↙ ↘ F4 F3 ``` Each feature gets a direction. Adjacent features are 72° apart (360° / 5). No two features are orthogonal, but they're spread as far apart as possible. ::: {.callout-important} ## Why a Pentagon? The pentagon arrangement minimizes the maximum interference between any two features. Any other arrangement would have some pair of features closer together, causing more interference when both are active. Gradient descent discovers this optimal geometry automatically. ::: ### Measuring the Geometry We can verify the pentagon arrangement directly: 1. Extract the weight matrix $W$ that maps inputs to the hidden layer 2. Each column of $W$ is the direction for one feature 3. Compute the angle between each pair of feature directions 4. Observe that adjacent features are 72° apart, opposite features are 144° apart The network has found the mathematically optimal arrangement without being told what it is. This is gradient descent discovering geometry. ```{python} #| label: fig-pentagon #| fig-cap: "5 features arranged as a pentagon in 2D. Adjacent features are 72° apart—the optimal arrangement that minimizes interference." #| code-fold: true import numpy as np import matplotlib.pyplot as plt fig, ax = plt.subplots(figsize=(6, 6)) # 5 features as a pentagon (72° spacing) n_features = 5 angles = np.linspace(np.pi/2, np.pi/2 + 2*np.pi, n_features + 1)[:-1] # Start from top colors = ['#e41a1c', '#377eb8', '#4daf4a', '#984ea3', '#ff7f00'] for i, angle in enumerate(angles): # Draw arrow ax.annotate('', xy=(np.cos(angle), np.sin(angle)), xytext=(0, 0), arrowprops=dict(arrowstyle='->', color=colors[i], lw=3)) # Label ax.text(1.2*np.cos(angle), 1.2*np.sin(angle), f'Feature {i+1}', ha='center', va='center', fontsize=11, fontweight='bold', color=colors[i]) # Draw the pentagon outline pentagon_x = [np.cos(a) for a in angles] + [np.cos(angles[0])] pentagon_y = [np.sin(a) for a in angles] + [np.sin(angles[0])] ax.plot(pentagon_x, pentagon_y, 'k--', alpha=0.3, lw=1) # Angle annotation ax.annotate('', xy=(0.4*np.cos(angles[0]), 0.4*np.sin(angles[0])), xytext=(0.4*np.cos(angles[1]), 0.4*np.sin(angles[1])), arrowprops=dict(arrowstyle='<->', color='gray', lw=1)) ax.text(0.5, 0.35, '72°', fontsize=10, color='gray') ax.set_xlim(-1.5, 1.5) ax.set_ylim(-1.5, 1.5) ax.set_aspect('equal') ax.axhline(y=0, color='gray', linestyle='-', alpha=0.2) ax.axvline(x=0, color='gray', linestyle='-', alpha=0.2) ax.set_title('Pentagon Arrangement: 5 Features in 2D', fontsize=14) ax.set_xlabel('Dimension 1') ax.set_ylabel('Dimension 2') ax.grid(True, alpha=0.2) plt.tight_layout() plt.show() ``` ### Interactive: Explore Polytope Arrangements Use the slider below to see how different numbers of features arrange themselves in 2D. Notice how the angles between features change as you add more. ```{ojs} //| echo: false viewof numFeatures = Inputs.range([3, 12], {step: 1, value: 5, label: "Number of features"}) ``` ```{ojs} //| echo: false { const width = 400; const height = 400; const radius = 150; const centerX = width / 2; const centerY = height / 2; // Calculate feature positions const features = []; for (let i = 0; i < numFeatures; i++) { const angle = (Math.PI / 2) + (2 * Math.PI * i / numFeatures); features.push({ x: centerX + radius * Math.cos(angle), y: centerY - radius * Math.sin(angle), angle: angle, index: i + 1 }); } // Calculate angle between adjacent features const angleDegrees = (360 / numFeatures).toFixed(1); const cosineSimAdj = Math.cos(2 * Math.PI / numFeatures).toFixed(3); // Colors for features const colors = ["#e41a1c", "#377eb8", "#4daf4a", "#984ea3", "#ff7f00", "#f781bf", "#a65628", "#999999", "#66c2a5", "#fc8d62", "#8da0cb", "#e78ac3"]; const svg = d3.create("svg") .attr("width", width) .attr("height", height) .attr("viewBox", [0, 0, width, height]); // Draw axes svg.append("line") .attr("x1", 20).attr("y1", centerY) .attr("x2", width - 20).attr("y2", centerY) .attr("stroke", "#ccc").attr("stroke-width", 1); svg.append("line") .attr("x1", centerX).attr("y1", 20) .attr("x2", centerX).attr("y2", height - 20) .attr("stroke", "#ccc").attr("stroke-width", 1); // Draw polygon outline const polygonPoints = features.map(f => `${f.x},${f.y}`).join(" "); svg.append("polygon") .attr("points", polygonPoints) .attr("fill", "none") .attr("stroke", "#333") .attr("stroke-width", 1) .attr("stroke-dasharray", "4,2") .attr("opacity", 0.3); // Draw feature vectors features.forEach((f, i) => { // Arrow line svg.append("line") .attr("x1", centerX).attr("y1", centerY) .attr("x2", f.x).attr("y2", f.y) .attr("stroke", colors[i % colors.length]) .attr("stroke-width", 3); // Arrowhead const arrowSize = 10; const angle = Math.atan2(centerY - f.y, f.x - centerX); svg.append("polygon") .attr("points", `${f.x},${f.y} ${f.x - arrowSize * Math.cos(angle - 0.3)},${f.y + arrowSize * Math.sin(angle - 0.3)} ${f.x - arrowSize * Math.cos(angle + 0.3)},${f.y + arrowSize * Math.sin(angle + 0.3)}`) .attr("fill", colors[i % colors.length]); // Label const labelRadius = radius + 25; svg.append("text") .attr("x", centerX + labelRadius * Math.cos(f.angle)) .attr("y", centerY - labelRadius * Math.sin(f.angle)) .attr("text-anchor", "middle") .attr("dominant-baseline", "middle") .attr("font-size", "12px") .attr("font-weight", "bold") .attr("fill", colors[i % colors.length]) .text(`F${f.index}`); }); // Info text svg.append("text") .attr("x", centerX).attr("y", height - 40) .attr("text-anchor", "middle") .attr("font-size", "14px") .text(`${numFeatures} features → ${angleDegrees}° between adjacent`); svg.append("text") .attr("x", centerX).attr("y", height - 20) .attr("text-anchor", "middle") .attr("font-size", "12px") .attr("fill", "#666") .text(`Adjacent cosine similarity: ${cosineSimAdj}`); return svg.node(); } ``` ::: {.callout-note} ## What to Notice - As you add more features, the angle between adjacent features shrinks - More features → higher cosine similarity → more interference when multiple are active - At some point (depending on sparsity), adding more features isn't worth the interference cost - This is why the phase transition exists: there's a limit to useful superposition ::: ## Phase Transitions The most striking finding from toy model experiments is the **phase transition**. As you vary the sparsity parameter $S$ (how often features are active), the network's behavior changes *discontinuously*. ### High Sparsity (Rare Features) When $S$ is small (say, 0.01—each feature active 1% of the time): - Features almost never co-occur - Interference is rare and worth accepting - Network uses heavy superposition - All 5 features get directions, arranged as a pentagon - Reconstruction quality is high except in rare cases ### Low Sparsity (Common Features) When $S$ is large (say, 0.3—each feature active 30% of the time): - Features frequently co-occur - Interference happens constantly - Superposition is too costly - Network dedicates dimensions to the most important features - Less important features are ignored entirely ### The Transition Between these regimes, there's a **sharp threshold**. At some critical sparsity $S^*$, the network's strategy flips: - Below $S^*$: superposed representation (pentagon) - Above $S^*$: dedicated representation (subset of features) The transition isn't gradual. Small changes in sparsity near the threshold cause the network to completely reorganize its representation. ```{python} #| label: fig-phase-transition #| fig-cap: "Phase transition in superposition: as sparsity decreases (features become more common), the network suddenly switches from superposed representation to dedicated representation." #| code-fold: true import numpy as np import matplotlib.pyplot as plt fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(11, 4)) # Simulated phase transition curve sparsity = np.linspace(0.01, 0.5, 100) # Sigmoid-like sharp transition around S* = 0.15 features_represented = 5 / (1 + np.exp(25 * (sparsity - 0.15))) + 2 * (1 - 1/(1 + np.exp(25 * (sparsity - 0.15)))) ax1.plot(sparsity, features_represented, 'b-', lw=2.5) ax1.axvline(x=0.15, color='red', linestyle='--', alpha=0.7, label='Critical sparsity S*') ax1.fill_between(sparsity[sparsity < 0.15], 0, 5.5, alpha=0.1, color='green') ax1.fill_between(sparsity[sparsity >= 0.15], 0, 5.5, alpha=0.1, color='orange') ax1.text(0.07, 4.8, 'Superposition\n(all 5 features)', fontsize=9, ha='center', color='green') ax1.text(0.32, 2.5, 'Dedicated\n(2 features)', fontsize=9, ha='center', color='orange') ax1.set_xlabel('Feature Activation Probability (S)', fontsize=11) ax1.set_ylabel('Features Represented', fontsize=11) ax1.set_title('Phase Transition: Features vs Sparsity', fontsize=12) ax1.set_ylim(0, 5.5) ax1.set_xlim(0, 0.5) ax1.legend(loc='upper right') ax1.grid(True, alpha=0.3) # Reconstruction loss loss_superposed = 0.02 + 0.5 * sparsity**2 # Low base + interference at high density loss_dedicated = 0.15 * np.ones_like(sparsity) # Constant (3/5 features missing) optimal_loss = np.minimum(loss_superposed, loss_dedicated) ax2.plot(sparsity, loss_superposed, 'g--', lw=2, label='Superposition strategy', alpha=0.7) ax2.plot(sparsity, loss_dedicated, 'orange', linestyle='--', lw=2, label='Dedicated strategy', alpha=0.7) ax2.plot(sparsity, optimal_loss, 'b-', lw=2.5, label='Optimal (network chooses)') ax2.axvline(x=0.15, color='red', linestyle='--', alpha=0.7) ax2.set_xlabel('Feature Activation Probability (S)', fontsize=11) ax2.set_ylabel('Reconstruction Loss', fontsize=11) ax2.set_title('Why the Transition Happens', fontsize=12) ax2.set_xlim(0, 0.5) ax2.legend(loc='upper left', fontsize=9) ax2.grid(True, alpha=0.3) plt.tight_layout() plt.show() ``` ::: {.callout-tip} ## A Physical Analogy Phase transitions in toy models are like phase transitions in physics—water freezing, or magnets aligning. The system doesn't gradually shift; it snaps from one qualitative state to another. The underlying mathematics is similar: optimization landscapes with multiple local minima, and sudden switches between them. ::: ## Polytope Arrangements The pentagon is just one example. The general pattern: networks arrange features as vertices of **regular polytopes**—the higher-dimensional generalizations of regular polygons. ### Examples **4 features in 2D**: Square (90° between adjacent features) **6 features in 2D**: Hexagon (60° between adjacent features) **4 features in 3D**: Tetrahedron (109.5° between all pairs) **6 features in 3D**: Octahedron The pattern is consistent: maximize the minimum angle between any two features. This minimizes worst-case interference. ### Why Polytopes? The loss function penalizes reconstruction error. When features interfere (because they're not orthogonal), reconstruction suffers. The network minimizes total interference by spreading features as far apart as possible. For $k$ features in $d$ dimensions where $k > d$, the optimal arrangement is the vertices of a regular polytope with $k$ vertices. This is a known result in geometric optimization, and neural networks rediscover it through gradient descent. ::: {.callout-note} ## Emergent Optimality No one programs these arrangements. The network simply tries to minimize reconstruction error, and the geometry falls out. This suggests that optimization pressure is enough to find elegant mathematical solutions—a recurring theme in deep learning. ::: ## The Mathematics Let's be more precise about interference and why polytopes are optimal. ### Interference Cost When two features have directions $\mathbf{w}_1$ and $\mathbf{w}_2$ with cosine similarity $c = \mathbf{w}_1 \cdot \mathbf{w}_2$, the interference when both are active is proportional to $c$. If feature 1 is active and feature 2 should be inactive, but they share direction, the reconstruction will incorrectly activate feature 2 by amount $\approx c$. The expected interference cost depends on: - How often features co-occur ($\propto S^2$ if independent) - How aligned their directions are ($\propto c^2$ for squared error) ### The Trade-off The network optimizes: $$\text{Total Loss} = \text{Capacity Loss} + \text{Interference Loss}$$ - **Capacity Loss**: Features that don't get represented at all contribute to error whenever they're active - **Interference Loss**: Superposed features contribute error when they co-occur At high sparsity, interference loss is low (features rarely co-occur), so the network superimposes aggressively to minimize capacity loss. At low sparsity, interference loss dominates, so the network reserves dimensions for important features and ignores others. ### Optimal Angles For $k$ features superposed in $d$ dimensions, the optimal arrangement has: - All features equidistant from each other (in angle) - Minimum pairwise cosine similarity For the pentagon in 2D: $\cos(72°) \approx 0.31$. For the hexagon: $\cos(60°) = 0.5$. More features means smaller angles, which means more interference. Eventually, the interference cost exceeds the benefit, and it's better to drop features than to pack more. ## Visualizing Superposition One of the great virtues of toy models is that we can *see* what's happening. ### Weight Matrix Visualization The weight matrix $W$ maps features to hidden dimensions. Each column of $W$ is a feature's direction in the hidden space. For 5 features in 2D, $W$ is a 2×5 matrix. Plotting the 5 columns as vectors in 2D directly shows the pentagon arrangement. ### Activation Patterns When a specific input (say, features 1 and 3 both active) passes through the network: 1. The hidden activation is the sum of feature 1's direction and feature 3's direction 2. The reconstruction projects this back to all 5 features 3. Features 1 and 3 reconstruct strongly; others reconstruct weakly (but not zero—this is interference) You can visualize which hidden dimensions "light up" for which features, and see the polysemanticity directly: each hidden dimension participates in multiple features. ### Training Dynamics Watching training in real-time reveals how the network discovers the optimal geometry: 1. Initially, feature directions are random 2. Gradient descent pushes features apart (to reduce interference) 3. Features rotate toward the polytope arrangement 4. The arrangement stabilizes when each feature is equidistant from its neighbors The optimization process is a geometric dance, and toy models let us watch every step. ## What Transfers to Real Models The toy model findings aren't just academic curiosities. They predict real phenomena in large language models. ### Superposition Ratio Anthropic's work on Claude 3 Sonnet found over 2 million interpretable features in a 4,096-dimensional layer. That's a ratio of ~500 features per dimension. Toy models predict exactly this: when features are sparse enough (as they are in natural language), you can pack exponentially more features than dimensions. ### Sparsity Matches Real features are sparse. The "Golden Gate Bridge" feature activates on ~0.05% of inputs. Most semantic features activate on less than 1% of inputs. This matches the toy model regime where superposition is optimal—sparsity below ~1% enables heavy compression without unbearable interference. ### Polysemanticity Explained The polysemantic neuron from Chapter 5 (cat faces + car fronts) is exactly what toy models predict. Each neuron participates in multiple feature directions. Looking at a single neuron shows a confusing mix; looking at the feature direction shows a clean concept. ### Steering Works Adding a feature direction to activations shifts behavior predictably. This only works because features are linear directions—exactly the geometry that toy models reveal. ::: {.callout-tip} ## The Transfer Principle Toy models operate by the same mathematical principles as large models: gradient descent finding efficient representations of sparse features in limited dimensions. The numbers differ (5 vs. 2 million features), but the geometry is the same. ::: ## What Doesn't Transfer Toy models are deliberately simple. Some aspects don't generalize. ### Ground Truth In toy models, we know the true features—we created them. In real models, we don't know which features the model "should" have. We're inferring features from behavior, not checking against ground truth. ### Feature Complexity Toy model features are binary (on/off) and independent. Real features are continuous, context-dependent, and hierarchically organized. "French" doesn't just turn on and off; it has degrees and interacts with other concepts. ### Architecture Toy models are simple autoencoders. Real transformers have attention, layer norms, residual connections, and dozens of layers. The compositional structure of transformers creates phenomena (like induction heads, which we'll cover in Chapter 13) that don't appear in toy models. ### Scale 5 features in 2 dimensions is tractable. 2 million features in 4,096 dimensions is a different beast. The principles may transfer, but the practical challenges of discovery and verification are much harder. ### Reconstruction Errors When sparse autoencoders are trained on real model activations, they show 10-40% reconstruction error. This suggests the clean superposition picture from toy models is only approximate in practice. Real representations are messier. ::: {.callout-important} ## The Limitation Toy models teach principles, not details. They explain *why* superposition exists and *how* it works geometrically. They don't tell us exactly which features a real model uses or how to find them efficiently. ::: ## Pedagogy: Why This Matters Toy models aren't just research tools. They're essential for building intuition. ### Making the Abstract Concrete "Features are directions in activation space" is abstract. Seeing a pentagon of 5 features in 2D makes it concrete. You can plot it, measure angles, verify predictions. ### Demystifying Optimization "Gradient descent finds efficient representations" sounds magical. Watching a toy model rotate its weight vectors toward the optimal polytope demystifies it. Optimization is just following the gradient, and the gradient points toward good geometry. ### Building Transferable Intuition Once you understand that 5 features form a pentagon in 2D, you have a mental model for what 2 million features might look like in 4,096D. You can't visualize it directly, but you understand the *kind* of structure that exists. ### Enabling Experimentation Anyone can train a toy model in an afternoon. You can vary sparsity, count features, change dimensions, and observe results. This hands-on experimentation builds intuition that reading papers cannot. ## Polya's Perspective: The Simpler Problem This chapter is a direct application of Polya's heuristic: **if you can't solve a problem, solve a simpler problem first**. Superposition in GPT-4 is too complex to study directly. But superposition in a 5-feature, 2-dimensional autoencoder is fully tractable. We can: - Verify that superposition occurs - Observe the optimal geometry - Measure the phase transitions - Predict when superposition will vs. won't happen Then we take these insights and apply them to the hard problem. We know what to look for. We know the geometry should be polytope-like. We know sparsity is the key parameter. We know polysemanticity is a cross-section of superposed directions. ::: {.callout-tip} ## Polya's Insight "If you can't solve the problem, look for a related problem that you *can* solve." Toy models are that related problem. They share the essential structure (superposition) while removing the incidental complexity (billions of parameters, unknown features, compositional architecture). ::: ## Looking Ahead We've now seen superposition in action—concrete, visual, fully understood. But features in isolation don't explain how models *compute*. A language model doesn't just represent "French" and "cooking" as separate features; it combines them to process "French cuisine" and predict appropriate continuations. How do features compose? How do multiple components work together to perform a computation? This is the question of **circuits**—the subject of the next chapter. If features are atoms, circuits are molecules. And just as understanding atoms prepared chemists to understand molecules, understanding features prepares us to understand how they combine into computational structures. --- ## Further Reading 1. **Toy Models of Superposition** — [Anthropic](https://transformer-circuits.pub/2022/toy_model/index.html): The foundational paper. Essential reading with interactive visualizations. 2. **Interactive Toy Model** — [Mitchell Gordon](https://mitchgordon.me/ToyModels/): Train toy models in your browser and watch superposition emerge in real time. 3. **Toy Models of Superposition: A Walkthrough** — [Neel Nanda](https://www.neelnanda.io/mechanistic-interpretability/toy-model-walkthrough): Step-by-step guide to understanding and reproducing the experiments. 4. **The Persian Rug: Solving Toy Models Analytically** — [arXiv:2410.12101](https://arxiv.org/abs/2410.12101): Mathematical analysis proving optimality of polytope arrangements. 5. **Dynamical versus Bayesian Phase Transitions** — [arXiv:2310.06301](https://arxiv.org/abs/2310.06301): Deep dive into the phase transition phenomenon and its mathematical structure. 6. **Scaling Monosemanticity** — [Anthropic](https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html): Applying toy model insights to real models, finding millions of features in Claude.