flowchart LR
subgraph ARC2["Arc II: The Theory"]
A5["Chapter 5<br/>Features"] --> A6["Chapter 6<br/>Superposition"]
A6 --> A7["Chapter 7<br/>Toy Models"]
A7 --> A8["Chapter 8<br/>Circuits"]
end
13 Arc II Summary
What you’ve learned about features, superposition, and circuits
You’ve completed Arc II: Core Theory. Before moving to Arc III (Techniques), consolidate what you’ve learned.
13.1 The Big Picture
You now understand what networks represent and how they compute:
13.2 Key Concepts to Remember
13.2.1 From Chapter 5: Features
- Features are directions in activation space, not neurons
- Neurons are polysemantic (respond to multiple unrelated concepts)
- The goal: decompose polysemantic neurons into monosemantic features
- Evidence: linear probes work, steering works, vector arithmetic works
13.2.2 From Chapter 6: Superposition
- Networks pack more features than dimensions using almost-orthogonal directions
- Sparsity enables this: if features rarely co-occur, interference is tolerable
- Phase transition: at critical sparsity, networks switch from dedicated → superposed
- This is why polysemanticity exists and interpretation is hard
13.2.3 From Chapter 7: Toy Models
- We can study superposition in controlled settings where we know ground truth
- Optimal arrangements form regular polytopes (pentagons in 2D, etc.)
- Toy models confirm: networks discover mathematically optimal geometric arrangements
- What we learn transfers (imperfectly) to real models
13.2.4 From Chapter 8: Circuits
- Features are atoms; circuits are molecules (compositions of features)
- A circuit: identifiable subnetwork that performs a specific computation
- IOI circuit: 26 heads in GPT-2 Small for indirect object identification
- Composition types: Q-composition, K-composition, V-composition
13.3 The Theory You’ve Built
| Concept | Key Insight |
|---|---|
| Features as directions | We know what to look for |
| Superposition | We understand why interpretation is hard |
| Phase transitions | Representation strategy depends on sparsity |
| Circuits | Features compose into interpretable algorithms |
13.4 Self-Test: Can You Answer These?
Superposition. Networks use almost-orthogonal directions that are close enough to perpendicular that interference is manageable. The key insight: sparsity makes this work. If features rarely co-occur, the interference when they do overlap is acceptable.
- Polysemantic neuron: Responds to multiple unrelated concepts (cat faces AND car fronts)
- Monosemantic feature: A direction in activation space corresponding to one interpretable concept
Features are the meaningful unit; neurons are just one arbitrary basis for the space.
K-composition: One head’s output modifies another head’s keys, changing what the second head attends to.
For induction heads: The previous token head writes “B follows A” into the residual stream. This information becomes part of the keys that the induction head searches. The induction head can then find positions where the previous token matches the current token—enabling pattern completion.
13.5 Achievements Unlocked
After completing Arc II, you have new abilities:
- Distinguish neurons from features—and explain why features are directions, not individual units
- Explain superposition at a cocktail party: “Networks pack more concepts than dimensions by using almost-orthogonal directions”
- Predict when superposition works: sparse features enable heavy compression; dense features force dedicated representations
- Understand polysemanticity as a symptom of superposition, not a bug
- Read circuits papers and understand what “head composition” means (Q/K/V composition)
You have a theoretical framework. You understand how networks organize representations. Arc III gives you the tools to find them.
13.6 What’s Next
Arc III introduces the techniques:
- Chapter 9 (SAEs): The tool for extracting features from superposition
- Chapter 10 (Attribution): Which components contributed to the output?
- Chapter 11 (Patching): Does this component cause the behavior?
- Chapter 12 (Ablation): What happens if we remove it?
You have the theory. Now you’ll learn how to apply it.