flowchart LR
subgraph ARC1["Arc I: What We're Studying"]
A1["Chapter 1<br/>Why Interpret?"] --> A2["Chapter 2<br/>Transformers"]
A2 --> A3["Chapter 3<br/>Residual Stream"]
A3 --> A4["Chapter 4<br/>Geometry"]
end
8 Arc I Summary
What you’ve learned about the computational substrate
You’ve completed Arc I: Foundations. Before moving to Arc II (Core Theory), consolidate what you’ve learned.
8.1 The Big Picture
You now understand what we’re reverse-engineering:
8.2 Key Concepts to Remember
8.2.1 From Chapter 1: Why Reverse-Engineer?
- Neural networks discover algorithms we didn’t teach them
- We have all the weights but don’t understand the algorithms
- Goal: Find the how behind the what
8.2.2 From Chapter 2: Transformers
- Transformers are matrix multiplication machines
- Attention = soft dictionary lookup (Q, K, V)
- MLPs = pattern-value memories storing knowledge
- Everything is linear algebra + a few nonlinearities (softmax, GELU)
8.2.3 From Chapter 3: Residual Stream
- The residual stream is a shared workspace
- Output = embedding + Σ(all component contributions)
- Components read from and write to the stream—they never talk directly
- The additive structure enables decomposition (attribution)
8.2.4 From Chapter 4: Geometry
- Activations are points in high-dimensional space
- Meaning has geometric structure: king - man + woman ≈ queen
- Linear representation hypothesis: features are directions
- High dimensions: distances become noisy, but near-orthogonality is abundant
8.3 The Foundation You’ve Built
| Concept | What It Means for Interpretability |
|---|---|
| Matrix multiplications | We can trace computation through linear algebra |
| Residual stream | We can decompose outputs into component contributions |
| Additive structure | Attribution is tractable |
| Geometric representations | Features are directions we can find and manipulate |
8.4 Self-Test: Can You Answer These?
Because the final output is a sum of contributions. We can measure exactly how much each attention head and MLP contributed to any prediction. Without this additivity, decomposition wouldn’t work.
- Query (Q): “What am I looking for?”
- Key (K): “What do I contain?”
- Value (V): “What information should I provide?”
Attention weight = softmax(Q · K^T / √d). The output is a weighted sum of values.
Because the network doesn’t know which axis is which—it just learns useful activation patterns. The meaningful structure is geometric (directions, distances, projections), not tied to the arbitrary neuron basis. A “cat” feature might involve many neurons, not just one.
8.5 Achievements Unlocked
After completing Arc I, you have new abilities:
- Read transformer code and understand what each component does (Q, K, V, MLP, residual connections)
- Visualize activations as points in high-dimensional space, not just lists of numbers
- Understand why “king - man + woman ≈ queen” works—and why this matters for interpretation
- Explain the residual stream as a shared workspace that enables component decomposition
- Appreciate the scale of high-dimensional spaces (near-orthogonality, concentration of measure)
You’ve built the mental model. You understand what we’re looking at. Arc II answers what it represents.
8.6 What’s Next
Arc II introduces the core theory:
- Chapter 5 (Features): What are the atoms of meaning in activation space?
- Chapter 6 (Superposition): How do networks pack more features than dimensions?
- Chapter 7 (Toy Models): How can we study superposition in controlled settings?
- Chapter 8 (Circuits): How do features compose into algorithms?
You have the substrate. Now you’ll learn what’s represented in it.
8.7 Discussion Questions (For Reading Groups)
If you’re reading this with others, here are questions worth debating:
Chapter 1 presented the case for reverse-engineering neural networks. But is this actually the best path to understanding and safety? What are the strongest arguments against this approach? Discuss:
- Could we achieve safety without understanding? (Testing, monitoring, constraints)
- Is understanding fundamentally intractable at scale?
- Are there other approaches (formal verification, structured architectures) that might work better?
The residual stream perspective treats components as independent readers/writers. But they’re not truly independent—they share weights, they’re co-trained, they might develop coordinated behaviors. What might we miss by thinking of them separately?
The linear representation hypothesis is elegant and has strong evidence. But convenience and truth aren’t the same thing. What would it mean for interpretability if:
- Features are actually curved manifolds, not straight lines?
- Features are context-dependent (different direction in different contexts)?
- High-level reasoning doesn’t decompose into features at all?
Each person: name one thing you currently believe about interpretability that could be wrong. What evidence would convince you it’s wrong?
8.8 Partner Exercise
If you have a coding partner, try this 20-minute exercise:
Person A: Use the logit lens to analyze “The capital of Germany is ” Person B: Use the logit lens to analyze ”2 + 2 = ”
Compare your findings: - Which layer does the prediction “lock in”? - Does factual knowledge work differently from arithmetic? - What would you investigate next?
Then swap prompts and see if your partner found the same patterns.