13  Arc II Summary

What you’ve learned about features, superposition, and circuits

TipCongratulations!

You’ve completed Arc II: Core Theory. Before moving to Arc III (Techniques), consolidate what you’ve learned.

13.1 The Big Picture

You now understand what networks represent and how they compute:

flowchart LR
    subgraph ARC2["Arc II: The Theory"]
        A5["Chapter 5<br/>Features"] --> A6["Chapter 6<br/>Superposition"]
        A6 --> A7["Chapter 7<br/>Toy Models"]
        A7 --> A8["Chapter 8<br/>Circuits"]
    end

13.2 Key Concepts to Remember

13.2.1 From Chapter 5: Features

  • Features are directions in activation space, not neurons
  • Neurons are polysemantic (respond to multiple unrelated concepts)
  • The goal: decompose polysemantic neurons into monosemantic features
  • Evidence: linear probes work, steering works, vector arithmetic works

13.2.2 From Chapter 6: Superposition

  • Networks pack more features than dimensions using almost-orthogonal directions
  • Sparsity enables this: if features rarely co-occur, interference is tolerable
  • Phase transition: at critical sparsity, networks switch from dedicated → superposed
  • This is why polysemanticity exists and interpretation is hard

13.2.3 From Chapter 7: Toy Models

  • We can study superposition in controlled settings where we know ground truth
  • Optimal arrangements form regular polytopes (pentagons in 2D, etc.)
  • Toy models confirm: networks discover mathematically optimal geometric arrangements
  • What we learn transfers (imperfectly) to real models

13.2.4 From Chapter 8: Circuits

  • Features are atoms; circuits are molecules (compositions of features)
  • A circuit: identifiable subnetwork that performs a specific computation
  • IOI circuit: 26 heads in GPT-2 Small for indirect object identification
  • Composition types: Q-composition, K-composition, V-composition

13.3 The Theory You’ve Built

Concept Key Insight
Features as directions We know what to look for
Superposition We understand why interpretation is hard
Phase transitions Representation strategy depends on sparsity
Circuits Features compose into interpretable algorithms

13.4 Self-Test: Can You Answer These?

Superposition. Networks use almost-orthogonal directions that are close enough to perpendicular that interference is manageable. The key insight: sparsity makes this work. If features rarely co-occur, the interference when they do overlap is acceptable.

  • Polysemantic neuron: Responds to multiple unrelated concepts (cat faces AND car fronts)
  • Monosemantic feature: A direction in activation space corresponding to one interpretable concept

Features are the meaningful unit; neurons are just one arbitrary basis for the space.

K-composition: One head’s output modifies another head’s keys, changing what the second head attends to.

For induction heads: The previous token head writes “B follows A” into the residual stream. This information becomes part of the keys that the induction head searches. The induction head can then find positions where the previous token matches the current token—enabling pattern completion.

13.5 Achievements Unlocked

TipYou Can Now…

After completing Arc II, you have new abilities:

  • Distinguish neurons from features—and explain why features are directions, not individual units
  • Explain superposition at a cocktail party: “Networks pack more concepts than dimensions by using almost-orthogonal directions”
  • Predict when superposition works: sparse features enable heavy compression; dense features force dedicated representations
  • Understand polysemanticity as a symptom of superposition, not a bug
  • Read circuits papers and understand what “head composition” means (Q/K/V composition)

You have a theoretical framework. You understand how networks organize representations. Arc III gives you the tools to find them.

13.6 What’s Next

Arc III introduces the techniques:

  • Chapter 9 (SAEs): The tool for extracting features from superposition
  • Chapter 10 (Attribution): Which components contributed to the output?
  • Chapter 11 (Patching): Does this component cause the behavior?
  • Chapter 12 (Ablation): What happens if we remove it?

You have the theory. Now you’ll learn how to apply it.