6 The Residual Stream

A shared workspace for computation

foundations

architecture

Author

Taras Tsugrii

Published

January 5, 2025

What You’ll Learn

Why the “residual stream” perspective changes everything for interpretability
How components read from and write to a shared workspace
Why the additive structure enables decomposition
The logit lens technique for reading intermediate predictions

Prerequisites

Required: Chapter 2: Transformers — understanding attention and MLP layers

Before You Read: Recall

From Chapter 2, recall:

What are Q, K, V in attention? (Query, Key, Value — for matching and retrieving information)
What do MLPs do? (Store and retrieve knowledge via pattern-value associations)
What’s the key insight about transformers? (They’re matrix multiplication machines)

6.1 A Shift in Perspective

In the previous chapter, we saw that transformers are matrix multiplication machines. Attention routes information between positions; MLPs store and retrieve knowledge. Layer by layer, the network transforms token embeddings into predictions.

Now we ask: Is “layer-by-layer processing” the right way to think about this? There’s a more powerful perspective.

But there’s a different way to think about what’s happening—a perspective that turns out to be far more useful for mechanistic interpretability.

Instead of thinking “layer 1 processes the input, then layer 2 processes layer 1’s output, then layer 3 processes layer 2’s output…” think this:

All components—every attention head and every MLP—read from and write to a single shared workspace called the residual stream.

This isn’t just a metaphor. It’s a precise description of what the architecture computes. And it fundamentally changes how we approach interpretation.

6.2 The Architecture, Revisited

Let’s look at what actually happens in a transformer forward pass. After embedding, each token is represented as a vector. Let’s call this initial vector $x_0$.

Now, what does “layer 1” do? In the standard telling: it takes $x_0$ as input and produces some output $x_1$.

But look more carefully at the equations:

attention_out = Attention(x_0)
x_0.5 = x_0 + attention_out

mlp_out = MLP(x_0.5)
x_1 = x_0.5 + mlp_out

See those plus signs? The attention output isn’t replacing $x_0$—it’s being added to it. Same for the MLP output.

Expanding this out:

x_1 = x_0 + attention_out + mlp_out

The vector after layer 1 is the sum of the original embedding plus contributions from attention plus contributions from the MLP.

This continues through all layers:

x_L = x_0 + Σ(attention contributions) + Σ(MLP contributions)

The final representation is the original embedding plus accumulated contributions from every attention head and every MLP across the entire network.

The Key Insight

The transformer doesn’t transform representations through a sequence of functions. It accumulates contributions from many components into a shared vector that flows through the network. This vector is the residual stream.

The Whiteboard Analogy

Imagine a team solving a problem on a shared whiteboard. Each team member (attention head, MLP) can read what’s on the board, do some thinking, and write their contribution. No one erases—they only add. At the end, the answer is everything on the whiteboard combined.

This is the residual stream. Components don’t talk to each other directly; they communicate through the shared workspace. Understanding what’s “on the whiteboard” at each point is the core of mechanistic interpretability.

6.3 Components as Readers and Writers

Let’s make the reading-and-writing metaphor precise.

Each component—whether an attention head or an MLP—does three things:

Read from the residual stream (take the current vector as input)
Compute something (apply its learned function)
Write to the residual stream (add its output to the vector)

The “residual stream” is just the vector that carries information through the network. At any point, it contains: - The original token embedding - Plus everything that all previous components have written

Here’s the crucial part: components don’t talk to each other directly. Attention head 3.7 (layer 3, head 7) never sends a message directly to MLP 5. Instead:

Head 3.7 writes something to the residual stream
MLP 5 reads from the residual stream
If MLP 5 uses information from head 3.7, it’s because that information is sitting in the residual stream

The residual stream is a communication channel. It’s the only way components can interact.

Visual Metaphor: The Shared Whiteboard

Think of the residual stream as a shared whiteboard in a meeting room:

┌─────────────────────────────────────────────────────────┐
│                  📋 RESIDUAL STREAM                      │
│              (The Shared Whiteboard)                    │
│                                                         │
│   "Paris" + "capital" + "France" + "answer needed"      │
│                                                         │
└─────────────────────────────────────────────────────────┘
        ↑ write    ↑ write    ↑ write    ↓ read
        │          │          │          │
   ┌────┴───┐ ┌────┴───┐ ┌────┴───┐ ┌────┴───┐
   │Head 1.3│ │Head 4.2│ │ MLP 6  │ │Head 9.1│
   │"I found│ │"This is│ │"France │ │"Let me │
   │ Paris" │ │capital"│ │→Paris" │ │ read..." │
   └────────┘ └────────┘ └────────┘ └────────┘

Each component reads what’s on the whiteboard
Each component adds its contribution (never erases!)
The final answer is the sum of everything written

flowchart LR
    X0["x₀<br/>(embedding)"] --> X1["x₁ = x₀ + attn₁ + mlp₁"]
    X1 --> X2["x₂ = x₁ + attn₂ + mlp₂"]
    X2 --> X3["..."]
    X3 --> XL["xₗ<br/>(final)"]

    H1["Attn 1.0"] -.-> X1
    H2["Attn 1.1"] -.-> X1
    M1["MLP 1"] -.-> X1
    H3["Attn 2.0"] -.-> X2
    M2["MLP 2"] -.-> X2

The residual stream accumulates contributions from each component. All communication happens through this shared vector.

6.4 Why This Matters

This perspective has profound implications for interpretability.

6.4.1 Decomposition is Possible

Because the final output is a sum of contributions, we can ask: “How much did each component contribute?”

The output logits (before softmax) for predicting the next token are computed by multiplying the final residual stream by an unembedding matrix:

logits = x_L @ W_unembed

But $x_L$ is a sum:

logits = (x_0 + head_1_out + head_2_out + ... + mlp_1_out + ...) @ W_unembed

Matrix multiplication distributes over addition:

logits = x_0 @ W_unembed + head_1_out @ W_unembed + head_2_out @ W_unembed + ...

Each term is the contribution of that component to the final prediction. We can literally add up how much each attention head and each MLP contributed to the probability of any given token.

This is the foundation of attribution methods in mechanistic interpretability.

6.4.2 Path Analysis

The residual stream creates a notion of paths through the network.

Consider: head 5.3 writes to the residual stream at layer 5. MLP 8 reads from the residual stream at layer 8. There’s a “path” from head 5.3 to MLP 8—the information flows through the residual stream.

We can think of the transformer as computing many paths simultaneously:

Direct path: embedding → unembedding (straight through)
Single-component paths: embedding → head 2.1 → unembedding
Multi-component paths: embedding → head 1.0 → head 3.4 → MLP 7 → unembedding
And exponentially many more…

The final output is the sum of contributions from all paths.

Path Explosion

A model with L layers, H heads per layer, and MLP layers has on the order of $(H+2)^L$ paths per token position. For GPT-2 (12 layers, 12 heads), that’s roughly $14^{12} \approx 10^{13}$ paths. In practice, most paths contribute negligibly, and the art of interpretation is finding the ones that matter.

6.4.3 Composition Becomes Visible

The residual stream is how attention heads compose with each other.

Consider the famous induction head circuit (explored fully in Chapter 13). An induction head performs in-context learning: if it sees “…Harry Potter… Harry” it predicts “Potter” because it saw that pattern before.

This requires two heads working together:

A previous token head (in an early layer) that copies information about what token came before each position
An induction head (in a later layer) that looks for previous occurrences of the current token and retrieves what followed

Here’s how they compose through the residual stream:

The previous token head writes “the token before position 15 was ‘Harry’” to the residual stream
This information sits in the stream
The induction head reads from the stream at position 100 (current “Harry”)
It uses this information to attend back to position 15
It retrieves the value at position 15 (which says “followed by ‘Potter’”)
It writes this prediction to the residual stream

Without the residual stream perspective, this looks like mysterious layer-to-layer processing. With it, we see two components communicating through a shared workspace. The circuit becomes visible.

6.5 Reading the Stream: The Logit Lens

If the residual stream accumulates information toward a final prediction, can we peek at it midway? Can we see what the model is “thinking” at intermediate layers?

Yes. The technique is called the logit lens.

The idea is simple: at any layer, take the current residual stream vector and project it to vocabulary space as if it were the final layer. Pretend you’re at the end of the network and decode what token would be predicted.

# At layer 6 of a 12-layer model
intermediate_logits = residual_stream_layer_6 @ W_unembed
intermediate_probs = softmax(intermediate_logits)
# What token does the model predict at this point?

What we see is fascinating: predictions refine through layers.

For a prompt like “The Eiffel Tower is located in”, early layers might vaguely predict location-related tokens. Middle layers might narrow to cities. Late layers converge on “Paris.”

The residual stream tells a story of progressive refinement. Early components write rough information; later components refine it. The logit lens lets us watch this process unfold.

Tuned Lens

The basic logit lens can be noisy because early residual stream representations aren’t aligned with the unembedding matrix. The tuned lens improves on this by learning a small affine transformation for each layer that better maps intermediate representations to vocabulary space. Same concept, cleaner signal.

6.6 What the Stream Contains

So what’s actually in the residual stream at any point?

The answer isn’t simple. The stream contains:

Token identity: What token is at this position
Positional information: Where in the sequence this token appears
Contextual information: What the attention heads have gathered from other positions
Semantic features: Concepts, relationships, patterns the model has detected
Predictions in progress: Proto-predictions that will be refined by later layers

All of this is packed into a vector of perhaps 768 or 4096 dimensions. Different “features” occupy different directions in this space. We’ll explore this geometric perspective fully in the next chapter.

For now, the key point is: the residual stream is dense with information. It’s not just a pipeline carrying data forward—it’s a rich representational space where components store and retrieve information.

6.7 A Concrete Example

Let’s trace what might happen for a single token in the prompt “The capital of France is”.

After embedding: The residual stream at position 5 (“France”) contains the France embedding—a 768-dimensional vector that encodes properties of the token.

After layer 1 attention: Attention heads notice “capital of” precedes this token. They write information encoding “this is the object of ‘capital of’” to the stream.

After layer 1 MLP: The MLP recognizes this pattern and might strengthen features related to “country” or “nation.”

After layer 4 attention: Heads attend to the full context and write information encoding “we’re asking about a capital city.”

After layer 6 MLP: The MLP retrieves associated knowledge, strengthening the “Paris” feature direction.

After layer 8: At this point, the logit lens might already show “Paris” as a top prediction.

Final layers: Refine the representation, handling edge cases, strengthening the prediction.

The residual stream at position 5 has transformed from “France token embedding” to “France token embedding + context about capital cities + knowledge retrieval activating Paris + prediction sharpening.”

Each layer’s contribution is added to what came before. Nothing is overwritten—information accumulates.

6.8 Implications for Interpretation

The residual stream perspective changes how we approach mechanistic interpretability.

6.8.1 Localization

We can ask: “Which components are responsible for this behavior?” By looking at what each component writes to the stream, we can identify the attention heads and MLPs that produce specific predictions.

6.8.2 Circuits

We can trace information flow: “How does information get from the input to the output?” The residual stream makes explicit the paths through which computation happens.

6.8.3 Interventions

We can test hypotheses: “What if we remove this component’s contribution?” By subtracting a component’s output from the residual stream, we can see if our understanding of its role is correct.

6.8.4 Decomposition

We can break down predictions: “Why did the model predict this token?” By decomposing the logits into per-component contributions, we can attribute the prediction to specific parts of the network.

The Stream as Bottleneck

The residual stream is also a bottleneck. With only 768 or 4096 dimensions, it must represent everything the model knows and is computing. This constraint forces superposition—the representation of more features than there are dimensions. We’ll explore this phenomenon in Arc II.

6.9 From Stream to Geometry

We’ve established what the residual stream is and why it matters. But we’ve been vague about what’s in it—talking about “features” and “directions” without precision.

The residual stream is a vector space. At each position, for each layer, we have a point in $\mathbb{R}^{768}$ (or whatever the model dimension is). The “features” we’ve mentioned are directions in this space. The “predictions” are projections onto vocabulary directions.

To understand what the model represents, we need to understand this geometry. What do the directions mean? How are features organized? How can we find them?

This geometric perspective is the subject of the next chapter. Once we understand activations as geometry, we’ll be ready to tackle the core questions: What are features? How do they compose? And why is interpretation both possible and hard?

6.10 Mini Case Study: Watching a Prediction Form

Let’s apply the residual stream perspective to a real example. This is your first taste of the analysis workflow—we’ll go deeper in Arc III.

The task: GPT-2 predicts the next token for “The Eiffel Tower is located in”. We want to see when the model commits to predicting “Paris.”

import transformer_lens as tl

model = tl.HookedTransformer.from_pretrained("gpt2-small")
prompt = "The Eiffel Tower is located in"
tokens = model.to_tokens(prompt)

# Get the token ID for " Paris"
paris_token = model.to_single_token(" Paris")

# Run with cache to get all intermediate states
logits, cache = model.run_with_cache(tokens)

# Apply logit lens at each layer
print("Layer | Logit for 'Paris' | Top prediction")
print("-" * 50)
for layer in range(model.cfg.n_layers):
    # Get residual stream at this layer
    resid = cache["resid_post", layer][0, -1, :]  # Last position

    # Project to vocabulary (logit lens)
    layer_logits = resid @ model.W_U

    # Get Paris logit and top prediction
    paris_logit = layer_logits[paris_token].item()
    top_token = layer_logits.argmax().item()
    top_word = model.tokenizer.decode(top_token)

    print(f"  {layer:2d}  |      {paris_logit:+.2f}       | {top_word}")

Typical output (approximate):

Layer | Logit for 'Paris' | Top prediction
--------------------------------------------------
   0  |      -2.34       | the
   1  |      -1.89       | the
   2  |      -0.45       | France
   3  |      +1.23       | France
   4  |      +3.45       | Paris
   5  |      +5.67       | Paris
  ...
  11  |      +12.34      | Paris

What we see: - Early layers (0-2): The model predicts common words. It hasn’t “decided” yet. - Middle layers (3-4): The model starts predicting relevant words (“France”). Information is being retrieved. - Later layers (5-11): The model commits to “Paris” with increasing confidence.

The Key Insight

The prediction doesn’t appear instantly—it develops through layers. Each layer’s components read from the residual stream, compute something, and write back. We can watch this process unfold.

Questions this raises (that we’ll answer in later chapters): - Which components are responsible for the jump at layer 4? - What information is being retrieved, and from where? - Could we intervene to change the prediction?

This is a taste of the residual stream in action. We’ll develop these techniques systematically in Arc III.

6.11 Looking Ahead

The residual stream reframes the transformer from “sequential layer processing” to “parallel component contributions.” This isn’t just a pedagogical shift—it’s the foundation for mechanistic interpretability.

With this framework, we can: - Decompose predictions into component contributions - Trace information flow through paths - Identify circuits by their reading and writing patterns - Test hypotheses by intervening on specific components

In the next chapter, we’ll zoom in on the geometry of the residual stream. What does it mean for features to be “directions”? How are concepts arranged in this high-dimensional space? And what tools do we have for navigating it?

The residual stream gives us the architecture. Geometry will give us the language.

6.12 Key Takeaways

📋 Summary Card

┌────────────────────────────────────────────────────────────┐
│  THE RESIDUAL STREAM                                       │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  WHAT IT IS:    A shared vector that flows through the     │
│                 entire network, accumulating information   │
│                                                            │
│  KEY INSIGHT:   Output = embedding + Σ(all contributions)  │
│                 Everything is ADDITIVE, not sequential     │
│                                                            │
│  WHY IT MATTERS:                                           │
│    ✓ Enables decomposition (who contributed what?)         │
│    ✓ Enables intervention (what if we change this?)        │
│    ✓ Enables path tracing (how did info flow?)             │
│                                                            │
│  MENTAL MODEL:  Shared whiteboard - components read        │
│                 from it and write to it, never erasing     │
│                                                            │
└────────────────────────────────────────────────────────────┘

6.13 Check Your Understanding

Question 1: Why can we decompose the final prediction into per-component contributions?

Answer: Because the residual stream uses addition. Each component’s output is added to the stream, and matrix multiplication (used for the final unembedding) distributes over addition. So:

logits = (embedding + head1 + head2 + ... + mlp1 + ...) × W_unembed

becomes:

logits = embedding×W + head1×W + head2×W + ...

Each term is one component’s contribution.

Question 2: Head 5.3 writes to the residual stream at layer 5. MLP 8 reads from it at layer 8. How do they “communicate”?

Answer: They communicate through the residual stream. Head 5.3 writes its output (a vector) by adding it to the stream. That information persists in the stream. When MLP 8 reads from the stream three layers later, Head 5.3’s contribution is still there as part of the accumulated sum. Components never communicate directly—only through this shared workspace.

Question 3: What does the logit lens reveal, and why does it work?

Answer: The logit lens shows what the model would predict if we stopped at an intermediate layer and decoded immediately. It works because the residual stream maintains linear structure relative to the vocabulary throughout the forward pass. Early layers show vague predictions; later layers show refined, confident predictions. This reveals the progressive refinement of information as it flows through the network.

6.14 Further Reading

A Mathematical Framework for Transformer Circuits — Anthropic: The foundational paper introducing the residual stream perspective and path analysis.
In-context Learning and Induction Heads — Anthropic: Deep dive into induction heads, the canonical example of head composition through the residual stream.
AXRP Episode 19: Mechanistic Interpretability with Neel Nanda — AXRP: Neel Nanda explains the residual stream, composition, and interpretability techniques.
Logit Lens — LessWrong: The original post introducing the logit lens technique for reading intermediate residual stream states.
Eliciting Latent Predictions from Transformers with the Tuned Lens — arXiv:2303.08112: Improving on the logit lens with learned per-layer transformations.
Exploring the Residual Stream of Transformers — arXiv:2312.12141: Recent work on interpreting residual stream contributions to predictions.

--- title: "The Residual Stream" subtitle: "A shared workspace for computation" author: "Taras Tsugrii" date: 2025-01-05 categories: [foundations, architecture] description: "The residual stream is a simple architectural feature with profound implications for interpretability. Once you see the transformer as components reading from and writing to a shared workspace, everything changes." --- ::: {.callout-tip} ## What You'll Learn - Why the "residual stream" perspective changes everything for interpretability - How components read from and write to a shared workspace - Why the additive structure enables decomposition - The logit lens technique for reading intermediate predictions ::: ::: {.callout-warning} ## Prerequisites **Required**: [Chapter 2: Transformers](02-transformers.qmd) — understanding attention and MLP layers ::: ::: {.callout-note} ## Before You Read: Recall From Chapter 2, recall: - What are Q, K, V in attention? (Query, Key, Value — for matching and retrieving information) - What do MLPs do? (Store and retrieve knowledge via pattern-value associations) - What's the key insight about transformers? (They're matrix multiplication machines) ::: ## A Shift in Perspective In the previous chapter, we saw that transformers are matrix multiplication machines. Attention routes information between positions; MLPs store and retrieve knowledge. Layer by layer, the network transforms token embeddings into predictions. **Now we ask**: Is "layer-by-layer processing" the right way to think about this? There's a more powerful perspective. But there's a different way to think about what's happening—a perspective that turns out to be far more useful for mechanistic interpretability. Instead of thinking "layer 1 processes the input, then layer 2 processes layer 1's output, then layer 3 processes layer 2's output..." think this: > All components—every attention head and every MLP—read from and write to a single shared workspace called the **residual stream**. This isn't just a metaphor. It's a precise description of what the architecture computes. And it fundamentally changes how we approach interpretation. ## The Architecture, Revisited Let's look at what actually happens in a transformer forward pass. After embedding, each token is represented as a vector. Let's call this initial vector $x_0$. Now, what does "layer 1" do? In the standard telling: it takes $x_0$ as input and produces some output $x_1$. But look more carefully at the equations: ``` attention_out = Attention(x_0) x_0.5 = x_0 + attention_out mlp_out = MLP(x_0.5) x_1 = x_0.5 + mlp_out ``` See those plus signs? The attention output isn't *replacing* $x_0$—it's being *added* to it. Same for the MLP output. Expanding this out: ``` x_1 = x_0 + attention_out + mlp_out ``` The vector after layer 1 is the *sum* of the original embedding plus contributions from attention plus contributions from the MLP. This continues through all layers: ``` x_L = x_0 + Σ(attention contributions) + Σ(MLP contributions) ``` The final representation is the original embedding plus accumulated contributions from every attention head and every MLP across the entire network. ::: {.callout-important} ## The Key Insight The transformer doesn't transform representations through a sequence of functions. It *accumulates* contributions from many components into a shared vector that flows through the network. This vector is the residual stream. ::: ::: {.callout-tip} ## The Whiteboard Analogy Imagine a team solving a problem on a shared whiteboard. Each team member (attention head, MLP) can read what's on the board, do some thinking, and write their contribution. No one erases—they only add. At the end, the answer is everything on the whiteboard combined. This is the residual stream. Components don't talk to each other directly; they communicate through the shared workspace. Understanding what's "on the whiteboard" at each point is the core of mechanistic interpretability. ::: ## Components as Readers and Writers Let's make the reading-and-writing metaphor precise. Each component—whether an attention head or an MLP—does three things: 1. **Read** from the residual stream (take the current vector as input) 2. **Compute** something (apply its learned function) 3. **Write** to the residual stream (add its output to the vector) The "residual stream" is just the vector that carries information through the network. At any point, it contains: - The original token embedding - Plus everything that all previous components have written Here's the crucial part: components don't talk to each other directly. Attention head 3.7 (layer 3, head 7) never sends a message directly to MLP 5. Instead: 1. Head 3.7 writes something to the residual stream 2. MLP 5 reads from the residual stream 3. If MLP 5 uses information from head 3.7, it's because that information is sitting in the residual stream The residual stream is a **communication channel**. It's the only way components can interact. ::: {.callout-tip} ## Visual Metaphor: The Shared Whiteboard Think of the residual stream as a **shared whiteboard** in a meeting room: ``` ┌─────────────────────────────────────────────────────────┐ │ 📋 RESIDUAL STREAM │ │ (The Shared Whiteboard) │ │ │ │ "Paris" + "capital" + "France" + "answer needed" │ │ │ └─────────────────────────────────────────────────────────┘ ↑ write ↑ write ↑ write ↓ read │ │ │ │ ┌────┴───┐ ┌────┴───┐ ┌────┴───┐ ┌────┴───┐ │Head 1.3│ │Head 4.2│ │ MLP 6 │ │Head 9.1│ │"I found│ │"This is│ │"France │ │"Let me │ │ Paris" │ │capital"│ │→Paris" │ │ read..." │ └────────┘ └────────┘ └────────┘ └────────┘ ``` - Each component **reads** what's on the whiteboard - Each component **adds** its contribution (never erases!) - The final answer is the sum of everything written ::: ```{mermaid} %%| fig-cap: "The residual stream accumulates contributions from each component. All communication happens through this shared vector." %%| fig-width: 8 flowchart LR X0["x₀<br/>(embedding)"] --> X1["x₁ = x₀ + attn₁ + mlp₁"] X1 --> X2["x₂ = x₁ + attn₂ + mlp₂"] X2 --> X3["..."] X3 --> XL["xₗ<br/>(final)"] H1["Attn 1.0"] -.-> X1 H2["Attn 1.1"] -.-> X1 M1["MLP 1"] -.-> X1 H3["Attn 2.0"] -.-> X2 M2["MLP 2"] -.-> X2 ``` ## Why This Matters This perspective has profound implications for interpretability. ### Decomposition is Possible Because the final output is a *sum* of contributions, we can ask: "How much did each component contribute?" The output logits (before softmax) for predicting the next token are computed by multiplying the final residual stream by an unembedding matrix: ``` logits = x_L @ W_unembed ``` But $x_L$ is a sum: ``` logits = (x_0 + head_1_out + head_2_out + ... + mlp_1_out + ...) @ W_unembed ``` Matrix multiplication distributes over addition: ``` logits = x_0 @ W_unembed + head_1_out @ W_unembed + head_2_out @ W_unembed + ... ``` Each term is the contribution of that component to the final prediction. We can literally add up how much each attention head and each MLP contributed to the probability of any given token. This is the foundation of **attribution** methods in mechanistic interpretability. ### Path Analysis The residual stream creates a notion of **paths** through the network. Consider: head 5.3 writes to the residual stream at layer 5. MLP 8 reads from the residual stream at layer 8. There's a "path" from head 5.3 to MLP 8—the information flows through the residual stream. We can think of the transformer as computing many paths simultaneously: - Direct path: embedding → unembedding (straight through) - Single-component paths: embedding → head 2.1 → unembedding - Multi-component paths: embedding → head 1.0 → head 3.4 → MLP 7 → unembedding - And exponentially many more... The final output is the sum of contributions from all paths. ::: {.callout-note} ## Path Explosion A model with L layers, H heads per layer, and MLP layers has on the order of $(H+2)^L$ paths per token position. For GPT-2 (12 layers, 12 heads), that's roughly $14^{12} \approx 10^{13}$ paths. In practice, most paths contribute negligibly, and the art of interpretation is finding the ones that matter. ::: ### Composition Becomes Visible The residual stream is how attention heads **compose** with each other. Consider the famous **induction head** circuit (explored fully in [Chapter 13](13-induction-heads.qmd)). An induction head performs in-context learning: if it sees "...Harry Potter... Harry" it predicts "Potter" because it saw that pattern before. This requires two heads working together: 1. A **previous token head** (in an early layer) that copies information about what token came before each position 2. An **induction head** (in a later layer) that looks for previous occurrences of the current token and retrieves what followed Here's how they compose through the residual stream: 1. The previous token head writes "the token before position 15 was 'Harry'" to the residual stream 2. This information sits in the stream 3. The induction head reads from the stream at position 100 (current "Harry") 4. It uses this information to attend back to position 15 5. It retrieves the value at position 15 (which says "followed by 'Potter'") 6. It writes this prediction to the residual stream Without the residual stream perspective, this looks like mysterious layer-to-layer processing. With it, we see two components communicating through a shared workspace. The circuit becomes visible. ## Reading the Stream: The Logit Lens If the residual stream accumulates information toward a final prediction, can we peek at it midway? Can we see what the model is "thinking" at intermediate layers? Yes. The technique is called the **logit lens**. The idea is simple: at any layer, take the current residual stream vector and project it to vocabulary space as if it were the final layer. Pretend you're at the end of the network and decode what token would be predicted. ```python # At layer 6 of a 12-layer model intermediate_logits = residual_stream_layer_6 @ W_unembed intermediate_probs = softmax(intermediate_logits) # What token does the model predict at this point? ``` What we see is fascinating: predictions *refine* through layers. For a prompt like "The Eiffel Tower is located in", early layers might vaguely predict location-related tokens. Middle layers might narrow to cities. Late layers converge on "Paris." The residual stream tells a story of progressive refinement. Early components write rough information; later components refine it. The logit lens lets us watch this process unfold. ::: {.callout-tip} ## Tuned Lens The basic logit lens can be noisy because early residual stream representations aren't aligned with the unembedding matrix. The **tuned lens** improves on this by learning a small affine transformation for each layer that better maps intermediate representations to vocabulary space. Same concept, cleaner signal. ::: ## What the Stream Contains So what's actually *in* the residual stream at any point? The answer isn't simple. The stream contains: - **Token identity**: What token is at this position - **Positional information**: Where in the sequence this token appears - **Contextual information**: What the attention heads have gathered from other positions - **Semantic features**: Concepts, relationships, patterns the model has detected - **Predictions in progress**: Proto-predictions that will be refined by later layers All of this is packed into a vector of perhaps 768 or 4096 dimensions. Different "features" occupy different directions in this space. We'll explore this geometric perspective fully in the next chapter. For now, the key point is: the residual stream is *dense with information*. It's not just a pipeline carrying data forward—it's a rich representational space where components store and retrieve information. ## A Concrete Example Let's trace what might happen for a single token in the prompt "The capital of France is". **After embedding**: The residual stream at position 5 ("France") contains the France embedding—a 768-dimensional vector that encodes properties of the token. **After layer 1 attention**: Attention heads notice "capital of" precedes this token. They write information encoding "this is the object of 'capital of'" to the stream. **After layer 1 MLP**: The MLP recognizes this pattern and might strengthen features related to "country" or "nation." **After layer 4 attention**: Heads attend to the full context and write information encoding "we're asking about a capital city." **After layer 6 MLP**: The MLP retrieves associated knowledge, strengthening the "Paris" feature direction. **After layer 8**: At this point, the logit lens might already show "Paris" as a top prediction. **Final layers**: Refine the representation, handling edge cases, strengthening the prediction. The residual stream at position 5 has transformed from "France token embedding" to "France token embedding + context about capital cities + knowledge retrieval activating Paris + prediction sharpening." Each layer's contribution is *added* to what came before. Nothing is overwritten—information accumulates. ## Implications for Interpretation The residual stream perspective changes how we approach mechanistic interpretability. ### Localization We can ask: "Which components are responsible for this behavior?" By looking at what each component writes to the stream, we can identify the attention heads and MLPs that produce specific predictions. ### Circuits We can trace information flow: "How does information get from the input to the output?" The residual stream makes explicit the paths through which computation happens. ### Interventions We can test hypotheses: "What if we remove this component's contribution?" By subtracting a component's output from the residual stream, we can see if our understanding of its role is correct. ### Decomposition We can break down predictions: "Why did the model predict this token?" By decomposing the logits into per-component contributions, we can attribute the prediction to specific parts of the network. ::: {.callout-note} ## The Stream as Bottleneck The residual stream is also a bottleneck. With only 768 or 4096 dimensions, it must represent everything the model knows and is computing. This constraint forces **superposition**—the representation of more features than there are dimensions. We'll explore this phenomenon in Arc II. ::: ## From Stream to Geometry We've established what the residual stream is and why it matters. But we've been vague about what's *in* it—talking about "features" and "directions" without precision. The residual stream is a vector space. At each position, for each layer, we have a point in $\mathbb{R}^{768}$ (or whatever the model dimension is). The "features" we've mentioned are directions in this space. The "predictions" are projections onto vocabulary directions. To understand what the model represents, we need to understand this geometry. What do the directions mean? How are features organized? How can we find them? This geometric perspective is the subject of the next chapter. Once we understand activations as geometry, we'll be ready to tackle the core questions: What are features? How do they compose? And why is interpretation both possible and hard? --- ## Mini Case Study: Watching a Prediction Form Let's apply the residual stream perspective to a real example. This is your first taste of the analysis workflow—we'll go deeper in Arc III. **The task**: GPT-2 predicts the next token for "The Eiffel Tower is located in". We want to see *when* the model commits to predicting "Paris." ```python import transformer_lens as tl model = tl.HookedTransformer.from_pretrained("gpt2-small") prompt = "The Eiffel Tower is located in" tokens = model.to_tokens(prompt) # Get the token ID for " Paris" paris_token = model.to_single_token(" Paris") # Run with cache to get all intermediate states logits, cache = model.run_with_cache(tokens) # Apply logit lens at each layer print("Layer | Logit for 'Paris' | Top prediction") print("-" * 50) for layer in range(model.cfg.n_layers): # Get residual stream at this layer resid = cache["resid_post", layer][0, -1, :] # Last position # Project to vocabulary (logit lens) layer_logits = resid @ model.W_U # Get Paris logit and top prediction paris_logit = layer_logits[paris_token].item() top_token = layer_logits.argmax().item() top_word = model.tokenizer.decode(top_token) print(f" {layer:2d} | {paris_logit:+.2f} | {top_word}") ``` **Typical output** (approximate): ``` Layer | Logit for 'Paris' | Top prediction -------------------------------------------------- 0 | -2.34 | the 1 | -1.89 | the 2 | -0.45 | France 3 | +1.23 | France 4 | +3.45 | Paris 5 | +5.67 | Paris ... 11 | +12.34 | Paris ``` **What we see**: - **Early layers (0-2)**: The model predicts common words. It hasn't "decided" yet. - **Middle layers (3-4)**: The model starts predicting relevant words ("France"). Information is being retrieved. - **Later layers (5-11)**: The model commits to "Paris" with increasing confidence. ::: {.callout-important} ## The Key Insight The prediction doesn't appear instantly—it *develops* through layers. Each layer's components read from the residual stream, compute something, and write back. We can watch this process unfold. ::: **Questions this raises** (that we'll answer in later chapters): - *Which* components are responsible for the jump at layer 4? - What information is being retrieved, and from where? - Could we intervene to change the prediction? This is a taste of the residual stream in action. We'll develop these techniques systematically in Arc III. --- ## Looking Ahead The residual stream reframes the transformer from "sequential layer processing" to "parallel component contributions." This isn't just a pedagogical shift—it's the foundation for mechanistic interpretability. With this framework, we can: - Decompose predictions into component contributions - Trace information flow through paths - Identify circuits by their reading and writing patterns - Test hypotheses by intervening on specific components In the next chapter, we'll zoom in on the geometry of the residual stream. What does it mean for features to be "directions"? How are concepts arranged in this high-dimensional space? And what tools do we have for navigating it? The residual stream gives us the architecture. Geometry will give us the language. --- ## Key Takeaways ::: {.callout-tip} ## 📋 Summary Card ``` ┌────────────────────────────────────────────────────────────┐ │ THE RESIDUAL STREAM │ ├────────────────────────────────────────────────────────────┤ │ │ │ WHAT IT IS: A shared vector that flows through the │ │ entire network, accumulating information │ │ │ │ KEY INSIGHT: Output = embedding + Σ(all contributions) │ │ Everything is ADDITIVE, not sequential │ │ │ │ WHY IT MATTERS: │ │ ✓ Enables decomposition (who contributed what?) │ │ ✓ Enables intervention (what if we change this?) │ │ ✓ Enables path tracing (how did info flow?) │ │ │ │ MENTAL MODEL: Shared whiteboard - components read │ │ from it and write to it, never erasing │ │ │ └────────────────────────────────────────────────────────────┘ ``` ::: ## Check Your Understanding ::: {.callout-note collapse="true"} ## Question 1: Why can we decompose the final prediction into per-component contributions? **Answer**: Because the residual stream uses *addition*. Each component's output is added to the stream, and matrix multiplication (used for the final unembedding) distributes over addition. So: `logits = (embedding + head1 + head2 + ... + mlp1 + ...) × W_unembed` becomes: `logits = embedding×W + head1×W + head2×W + ...` Each term is one component's contribution. ::: ::: {.callout-note collapse="true"} ## Question 2: Head 5.3 writes to the residual stream at layer 5. MLP 8 reads from it at layer 8. How do they "communicate"? **Answer**: They communicate *through the residual stream*. Head 5.3 writes its output (a vector) by adding it to the stream. That information persists in the stream. When MLP 8 reads from the stream three layers later, Head 5.3's contribution is still there as part of the accumulated sum. Components never communicate directly—only through this shared workspace. ::: ::: {.callout-note collapse="true"} ## Question 3: What does the logit lens reveal, and why does it work? **Answer**: The logit lens shows what the model would predict if we stopped at an intermediate layer and decoded immediately. It works because the residual stream maintains linear structure relative to the vocabulary throughout the forward pass. Early layers show vague predictions; later layers show refined, confident predictions. This reveals the *progressive refinement* of information as it flows through the network. ::: --- ## Further Reading 1. **A Mathematical Framework for Transformer Circuits** — [Anthropic](https://transformer-circuits.pub/2021/framework/index.html): The foundational paper introducing the residual stream perspective and path analysis. 2. **In-context Learning and Induction Heads** — [Anthropic](https://transformer-circuits.pub/2022/in-context-learning-and-induction-heads/index.html): Deep dive into induction heads, the canonical example of head composition through the residual stream. 3. **AXRP Episode 19: Mechanistic Interpretability with Neel Nanda** — [AXRP](https://axrp.net/episode/2023/02/04/episode-19-mechanistic-interpretability-neel-nanda.html): Neel Nanda explains the residual stream, composition, and interpretability techniques. 4. **Logit Lens** — [LessWrong](https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens): The original post introducing the logit lens technique for reading intermediate residual stream states. 5. **Eliciting Latent Predictions from Transformers with the Tuned Lens** — [arXiv:2303.08112](https://arxiv.org/abs/2303.08112): Improving on the logit lens with learned per-layer transformations. 6. **Exploring the Residual Stream of Transformers** — [arXiv:2312.12141](https://arxiv.org/abs/2312.12141): Recent work on interpreting residual stream contributions to predictions.