8 Arc I Summary

What you’ve learned about the computational substrate

Congratulations!

You’ve completed Arc I: Foundations. Before moving to Arc II (Core Theory), consolidate what you’ve learned.

8.1 The Big Picture

You now understand what we’re reverse-engineering:

flowchart LR
    subgraph ARC1["Arc I: What We're Studying"]
        A1["Chapter 1<br/>Why Interpret?"] --> A2["Chapter 2<br/>Transformers"]
        A2 --> A3["Chapter 3<br/>Residual Stream"]
        A3 --> A4["Chapter 4<br/>Geometry"]
    end

8.2 Key Concepts to Remember

8.2.1 From Chapter 1: Why Reverse-Engineer?

Neural networks discover algorithms we didn’t teach them
We have all the weights but don’t understand the algorithms
Goal: Find the how behind the what

8.2.2 From Chapter 2: Transformers

Transformers are matrix multiplication machines
Attention = soft dictionary lookup (Q, K, V)
MLPs = pattern-value memories storing knowledge
Everything is linear algebra + a few nonlinearities (softmax, GELU)

8.2.3 From Chapter 3: Residual Stream

The residual stream is a shared workspace
Output = embedding + Σ(all component contributions)
Components read from and write to the stream—they never talk directly
The additive structure enables decomposition (attribution)

8.2.4 From Chapter 4: Geometry

Activations are points in high-dimensional space
Meaning has geometric structure: king - man + woman ≈ queen
Linear representation hypothesis: features are directions
High dimensions: distances become noisy, but near-orthogonality is abundant

8.3 The Foundation You’ve Built

Concept	What It Means for Interpretability
Matrix multiplications	We can trace computation through linear algebra
Residual stream	We can decompose outputs into component contributions
Additive structure	Attribution is tractable
Geometric representations	Features are directions we can find and manipulate

8.4 Self-Test: Can You Answer These?

1. Why does the additive structure of the residual stream matter for interpretability?

Because the final output is a sum of contributions. We can measure exactly how much each attention head and MLP contributed to any prediction. Without this additivity, decomposition wouldn’t work.

2. What do Q, K, V represent in attention?

Query (Q): “What am I looking for?”
Key (K): “What do I contain?”
Value (V): “What information should I provide?”

Attention weight = softmax(Q · K^T / √d). The output is a weighted sum of values.

3. Why are features “directions” rather than “neurons”?

Because the network doesn’t know which axis is which—it just learns useful activation patterns. The meaningful structure is geometric (directions, distances, projections), not tied to the arbitrary neuron basis. A “cat” feature might involve many neurons, not just one.

8.5 Achievements Unlocked

You Can Now…

After completing Arc I, you have new abilities:

Read transformer code and understand what each component does (Q, K, V, MLP, residual connections)
Visualize activations as points in high-dimensional space, not just lists of numbers
Understand why “king - man + woman ≈ queen” works—and why this matters for interpretation
Explain the residual stream as a shared workspace that enables component decomposition
Appreciate the scale of high-dimensional spaces (near-orthogonality, concentration of measure)

You’ve built the mental model. You understand what we’re looking at. Arc II answers what it represents.

8.6 What’s Next

Arc II introduces the core theory:

Chapter 5 (Features): What are the atoms of meaning in activation space?
Chapter 6 (Superposition): How do networks pack more features than dimensions?
Chapter 7 (Toy Models): How can we study superposition in controlled settings?
Chapter 8 (Circuits): How do features compose into algorithms?

You have the substrate. Now you’ll learn what’s represented in it.

8.7 Discussion Questions (For Reading Groups)

If you’re reading this with others, here are questions worth debating:

1. Is mechanistic interpretability the right approach?

Chapter 1 presented the case for reverse-engineering neural networks. But is this actually the best path to understanding and safety? What are the strongest arguments against this approach? Discuss:

Could we achieve safety without understanding? (Testing, monitoring, constraints)
Is understanding fundamentally intractable at scale?
Are there other approaches (formal verification, structured architectures) that might work better?

2. What’s lost in the “residual stream” abstraction?

The residual stream perspective treats components as independent readers/writers. But they’re not truly independent—they share weights, they’re co-trained, they might develop coordinated behaviors. What might we miss by thinking of them separately?

3. Is “features as directions” too good to be true?

The linear representation hypothesis is elegant and has strong evidence. But convenience and truth aren’t the same thing. What would it mean for interpretability if:

Features are actually curved manifolds, not straight lines?
Features are context-dependent (different direction in different contexts)?
High-level reasoning doesn’t decompose into features at all?

4. What would change your mind?

Each person: name one thing you currently believe about interpretability that could be wrong. What evidence would convince you it’s wrong?

8.8 Partner Exercise

If you have a coding partner, try this 20-minute exercise:

Person A: Use the logit lens to analyze “The capital of Germany is ” Person B: Use the logit lens to analyze ”2 + 2 = ”

Compare your findings: - Which layer does the prediction “lock in”? - Does factual knowledge work differently from arithmetic? - What would you investigate next?

Then swap prompts and see if your partner found the same patterns.

--- title: "Arc I Summary" subtitle: "What you've learned about the computational substrate" --- ::: {.callout-tip} ## Congratulations! You've completed Arc I: Foundations. Before moving to Arc II (Core Theory), consolidate what you've learned. ::: ## The Big Picture You now understand **what we're reverse-engineering**: ```{mermaid} %%| fig-width: 9 flowchart LR subgraph ARC1["Arc I: What We're Studying"] A1["Chapter 1<br/>Why Interpret?"] --> A2["Chapter 2<br/>Transformers"] A2 --> A3["Chapter 3<br/>Residual Stream"] A3 --> A4["Chapter 4<br/>Geometry"] end ``` ## Key Concepts to Remember ### From Chapter 1: Why Reverse-Engineer? - Neural networks discover algorithms we didn't teach them - We have all the weights but don't understand the algorithms - Goal: Find the *how* behind the *what* ### From Chapter 2: Transformers - Transformers are **matrix multiplication machines** - Attention = soft dictionary lookup (Q, K, V) - MLPs = pattern-value memories storing knowledge - Everything is linear algebra + a few nonlinearities (softmax, GELU) ### From Chapter 3: Residual Stream - The residual stream is a **shared workspace** - Output = embedding + Σ(all component contributions) - Components read from and write to the stream—they never talk directly - The additive structure enables **decomposition** (attribution) ### From Chapter 4: Geometry - Activations are points in high-dimensional space - Meaning has geometric structure: king - man + woman ≈ queen - **Linear representation hypothesis**: features are directions - High dimensions: distances become noisy, but near-orthogonality is abundant ## The Foundation You've Built | Concept | What It Means for Interpretability | |---------|-----------------------------------| | Matrix multiplications | We can trace computation through linear algebra | | Residual stream | We can decompose outputs into component contributions | | Additive structure | Attribution is tractable | | Geometric representations | Features are directions we can find and manipulate | ## Self-Test: Can You Answer These? ::: {.callout-note collapse="true"} ## 1. Why does the additive structure of the residual stream matter for interpretability? Because the final output is a **sum** of contributions. We can measure exactly how much each attention head and MLP contributed to any prediction. Without this additivity, decomposition wouldn't work. ::: ::: {.callout-note collapse="true"} ## 2. What do Q, K, V represent in attention? - **Query (Q)**: "What am I looking for?" - **Key (K)**: "What do I contain?" - **Value (V)**: "What information should I provide?" Attention weight = softmax(Q · K^T / √d). The output is a weighted sum of values. ::: ::: {.callout-note collapse="true"} ## 3. Why are features "directions" rather than "neurons"? Because the network doesn't know which axis is which—it just learns useful activation patterns. The meaningful structure is geometric (directions, distances, projections), not tied to the arbitrary neuron basis. A "cat" feature might involve many neurons, not just one. ::: ## Achievements Unlocked ::: {.callout-tip} ## You Can Now... After completing Arc I, you have new abilities: - **Read transformer code** and understand what each component does (Q, K, V, MLP, residual connections) - **Visualize activations** as points in high-dimensional space, not just lists of numbers - **Understand why "king - man + woman ≈ queen"** works—and why this matters for interpretation - **Explain the residual stream** as a shared workspace that enables component decomposition - **Appreciate the scale** of high-dimensional spaces (near-orthogonality, concentration of measure) You've built the mental model. You understand *what* we're looking at. Arc II answers *what it represents*. ::: ## What's Next Arc II introduces the **core theory**: - **Chapter 5 (Features)**: What are the atoms of meaning in activation space? - **Chapter 6 (Superposition)**: How do networks pack more features than dimensions? - **Chapter 7 (Toy Models)**: How can we study superposition in controlled settings? - **Chapter 8 (Circuits)**: How do features compose into algorithms? You have the substrate. Now you'll learn what's represented in it. --- ## Discussion Questions (For Reading Groups) If you're reading this with others, here are questions worth debating: ::: {.callout-tip} ## 1. Is mechanistic interpretability the right approach? Chapter 1 presented the case for reverse-engineering neural networks. But is this actually the best path to understanding and safety? What are the strongest arguments *against* this approach? Discuss: - Could we achieve safety without understanding? (Testing, monitoring, constraints) - Is understanding fundamentally intractable at scale? - Are there other approaches (formal verification, structured architectures) that might work better? ::: ::: {.callout-tip} ## 2. What's lost in the "residual stream" abstraction? The residual stream perspective treats components as independent readers/writers. But they're not truly independent—they share weights, they're co-trained, they might develop coordinated behaviors. What might we miss by thinking of them separately? ::: ::: {.callout-tip} ## 3. Is "features as directions" too good to be true? The linear representation hypothesis is elegant and has strong evidence. But convenience and truth aren't the same thing. What would it mean for interpretability if: - Features are actually *curved* manifolds, not straight lines? - Features are context-dependent (different direction in different contexts)? - High-level reasoning doesn't decompose into features at all? ::: ::: {.callout-tip} ## 4. What would change your mind? Each person: name one thing you currently believe about interpretability that could be wrong. What evidence would convince you it's wrong? ::: --- ## Partner Exercise If you have a coding partner, try this 20-minute exercise: **Person A**: Use the logit lens to analyze "The capital of Germany is ___" **Person B**: Use the logit lens to analyze "2 + 2 = ___" Compare your findings: - Which layer does the prediction "lock in"? - Does factual knowledge work differently from arithmetic? - What would you investigate next? Then swap prompts and see if your partner found the same patterns.