flowchart LR
X0["x₀<br/>(embedding)"] --> X1["x₁ = x₀ + attn₁ + mlp₁"]
X1 --> X2["x₂ = x₁ + attn₂ + mlp₂"]
X2 --> X3["..."]
X3 --> XL["xₗ<br/>(final)"]
H1["Attn 1.0"] -.-> X1
H2["Attn 1.1"] -.-> X1
M1["MLP 1"] -.-> X1
H3["Attn 2.0"] -.-> X2
M2["MLP 2"] -.-> X2
6 The Residual Stream
A shared workspace for computation
- Why the “residual stream” perspective changes everything for interpretability
- How components read from and write to a shared workspace
- Why the additive structure enables decomposition
- The logit lens technique for reading intermediate predictions
Required: Chapter 2: Transformers — understanding attention and MLP layers
From Chapter 2, recall:
- What are Q, K, V in attention? (Query, Key, Value — for matching and retrieving information)
- What do MLPs do? (Store and retrieve knowledge via pattern-value associations)
- What’s the key insight about transformers? (They’re matrix multiplication machines)
6.1 A Shift in Perspective
In the previous chapter, we saw that transformers are matrix multiplication machines. Attention routes information between positions; MLPs store and retrieve knowledge. Layer by layer, the network transforms token embeddings into predictions.
Now we ask: Is “layer-by-layer processing” the right way to think about this? There’s a more powerful perspective.
But there’s a different way to think about what’s happening—a perspective that turns out to be far more useful for mechanistic interpretability.
Instead of thinking “layer 1 processes the input, then layer 2 processes layer 1’s output, then layer 3 processes layer 2’s output…” think this:
All components—every attention head and every MLP—read from and write to a single shared workspace called the residual stream.
This isn’t just a metaphor. It’s a precise description of what the architecture computes. And it fundamentally changes how we approach interpretation.
6.2 The Architecture, Revisited
Let’s look at what actually happens in a transformer forward pass. After embedding, each token is represented as a vector. Let’s call this initial vector \(x_0\).
Now, what does “layer 1” do? In the standard telling: it takes \(x_0\) as input and produces some output \(x_1\).
But look more carefully at the equations:
attention_out = Attention(x_0)
x_0.5 = x_0 + attention_out
mlp_out = MLP(x_0.5)
x_1 = x_0.5 + mlp_out
See those plus signs? The attention output isn’t replacing \(x_0\)—it’s being added to it. Same for the MLP output.
Expanding this out:
x_1 = x_0 + attention_out + mlp_out
The vector after layer 1 is the sum of the original embedding plus contributions from attention plus contributions from the MLP.
This continues through all layers:
x_L = x_0 + Σ(attention contributions) + Σ(MLP contributions)
The final representation is the original embedding plus accumulated contributions from every attention head and every MLP across the entire network.
The transformer doesn’t transform representations through a sequence of functions. It accumulates contributions from many components into a shared vector that flows through the network. This vector is the residual stream.
Imagine a team solving a problem on a shared whiteboard. Each team member (attention head, MLP) can read what’s on the board, do some thinking, and write their contribution. No one erases—they only add. At the end, the answer is everything on the whiteboard combined.
This is the residual stream. Components don’t talk to each other directly; they communicate through the shared workspace. Understanding what’s “on the whiteboard” at each point is the core of mechanistic interpretability.
6.3 Components as Readers and Writers
Let’s make the reading-and-writing metaphor precise.
Each component—whether an attention head or an MLP—does three things:
- Read from the residual stream (take the current vector as input)
- Compute something (apply its learned function)
- Write to the residual stream (add its output to the vector)
The “residual stream” is just the vector that carries information through the network. At any point, it contains: - The original token embedding - Plus everything that all previous components have written
Here’s the crucial part: components don’t talk to each other directly. Attention head 3.7 (layer 3, head 7) never sends a message directly to MLP 5. Instead:
- Head 3.7 writes something to the residual stream
- MLP 5 reads from the residual stream
- If MLP 5 uses information from head 3.7, it’s because that information is sitting in the residual stream
The residual stream is a communication channel. It’s the only way components can interact.
Think of the residual stream as a shared whiteboard in a meeting room:
┌─────────────────────────────────────────────────────────┐
│ 📋 RESIDUAL STREAM │
│ (The Shared Whiteboard) │
│ │
│ "Paris" + "capital" + "France" + "answer needed" │
│ │
└─────────────────────────────────────────────────────────┘
↑ write ↑ write ↑ write ↓ read
│ │ │ │
┌────┴───┐ ┌────┴───┐ ┌────┴───┐ ┌────┴───┐
│Head 1.3│ │Head 4.2│ │ MLP 6 │ │Head 9.1│
│"I found│ │"This is│ │"France │ │"Let me │
│ Paris" │ │capital"│ │→Paris" │ │ read..." │
└────────┘ └────────┘ └────────┘ └────────┘
- Each component reads what’s on the whiteboard
- Each component adds its contribution (never erases!)
- The final answer is the sum of everything written
6.4 Why This Matters
This perspective has profound implications for interpretability.
6.4.1 Decomposition is Possible
Because the final output is a sum of contributions, we can ask: “How much did each component contribute?”
The output logits (before softmax) for predicting the next token are computed by multiplying the final residual stream by an unembedding matrix:
logits = x_L @ W_unembed
But \(x_L\) is a sum:
logits = (x_0 + head_1_out + head_2_out + ... + mlp_1_out + ...) @ W_unembed
Matrix multiplication distributes over addition:
logits = x_0 @ W_unembed + head_1_out @ W_unembed + head_2_out @ W_unembed + ...
Each term is the contribution of that component to the final prediction. We can literally add up how much each attention head and each MLP contributed to the probability of any given token.
This is the foundation of attribution methods in mechanistic interpretability.
6.4.2 Path Analysis
The residual stream creates a notion of paths through the network.
Consider: head 5.3 writes to the residual stream at layer 5. MLP 8 reads from the residual stream at layer 8. There’s a “path” from head 5.3 to MLP 8—the information flows through the residual stream.
We can think of the transformer as computing many paths simultaneously:
- Direct path: embedding → unembedding (straight through)
- Single-component paths: embedding → head 2.1 → unembedding
- Multi-component paths: embedding → head 1.0 → head 3.4 → MLP 7 → unembedding
- And exponentially many more…
The final output is the sum of contributions from all paths.
A model with L layers, H heads per layer, and MLP layers has on the order of \((H+2)^L\) paths per token position. For GPT-2 (12 layers, 12 heads), that’s roughly \(14^{12} \approx 10^{13}\) paths. In practice, most paths contribute negligibly, and the art of interpretation is finding the ones that matter.
6.4.3 Composition Becomes Visible
The residual stream is how attention heads compose with each other.
Consider the famous induction head circuit (explored fully in Chapter 13). An induction head performs in-context learning: if it sees “…Harry Potter… Harry” it predicts “Potter” because it saw that pattern before.
This requires two heads working together:
- A previous token head (in an early layer) that copies information about what token came before each position
- An induction head (in a later layer) that looks for previous occurrences of the current token and retrieves what followed
Here’s how they compose through the residual stream:
- The previous token head writes “the token before position 15 was ‘Harry’” to the residual stream
- This information sits in the stream
- The induction head reads from the stream at position 100 (current “Harry”)
- It uses this information to attend back to position 15
- It retrieves the value at position 15 (which says “followed by ‘Potter’”)
- It writes this prediction to the residual stream
Without the residual stream perspective, this looks like mysterious layer-to-layer processing. With it, we see two components communicating through a shared workspace. The circuit becomes visible.
6.5 Reading the Stream: The Logit Lens
If the residual stream accumulates information toward a final prediction, can we peek at it midway? Can we see what the model is “thinking” at intermediate layers?
Yes. The technique is called the logit lens.
The idea is simple: at any layer, take the current residual stream vector and project it to vocabulary space as if it were the final layer. Pretend you’re at the end of the network and decode what token would be predicted.
What we see is fascinating: predictions refine through layers.
For a prompt like “The Eiffel Tower is located in”, early layers might vaguely predict location-related tokens. Middle layers might narrow to cities. Late layers converge on “Paris.”
The residual stream tells a story of progressive refinement. Early components write rough information; later components refine it. The logit lens lets us watch this process unfold.
The basic logit lens can be noisy because early residual stream representations aren’t aligned with the unembedding matrix. The tuned lens improves on this by learning a small affine transformation for each layer that better maps intermediate representations to vocabulary space. Same concept, cleaner signal.
6.6 What the Stream Contains
So what’s actually in the residual stream at any point?
The answer isn’t simple. The stream contains:
- Token identity: What token is at this position
- Positional information: Where in the sequence this token appears
- Contextual information: What the attention heads have gathered from other positions
- Semantic features: Concepts, relationships, patterns the model has detected
- Predictions in progress: Proto-predictions that will be refined by later layers
All of this is packed into a vector of perhaps 768 or 4096 dimensions. Different “features” occupy different directions in this space. We’ll explore this geometric perspective fully in the next chapter.
For now, the key point is: the residual stream is dense with information. It’s not just a pipeline carrying data forward—it’s a rich representational space where components store and retrieve information.
6.7 A Concrete Example
Let’s trace what might happen for a single token in the prompt “The capital of France is”.
After embedding: The residual stream at position 5 (“France”) contains the France embedding—a 768-dimensional vector that encodes properties of the token.
After layer 1 attention: Attention heads notice “capital of” precedes this token. They write information encoding “this is the object of ‘capital of’” to the stream.
After layer 1 MLP: The MLP recognizes this pattern and might strengthen features related to “country” or “nation.”
After layer 4 attention: Heads attend to the full context and write information encoding “we’re asking about a capital city.”
After layer 6 MLP: The MLP retrieves associated knowledge, strengthening the “Paris” feature direction.
After layer 8: At this point, the logit lens might already show “Paris” as a top prediction.
Final layers: Refine the representation, handling edge cases, strengthening the prediction.
The residual stream at position 5 has transformed from “France token embedding” to “France token embedding + context about capital cities + knowledge retrieval activating Paris + prediction sharpening.”
Each layer’s contribution is added to what came before. Nothing is overwritten—information accumulates.
6.8 Implications for Interpretation
The residual stream perspective changes how we approach mechanistic interpretability.
6.8.1 Localization
We can ask: “Which components are responsible for this behavior?” By looking at what each component writes to the stream, we can identify the attention heads and MLPs that produce specific predictions.
6.8.2 Circuits
We can trace information flow: “How does information get from the input to the output?” The residual stream makes explicit the paths through which computation happens.
6.8.3 Interventions
We can test hypotheses: “What if we remove this component’s contribution?” By subtracting a component’s output from the residual stream, we can see if our understanding of its role is correct.
6.8.4 Decomposition
We can break down predictions: “Why did the model predict this token?” By decomposing the logits into per-component contributions, we can attribute the prediction to specific parts of the network.
The residual stream is also a bottleneck. With only 768 or 4096 dimensions, it must represent everything the model knows and is computing. This constraint forces superposition—the representation of more features than there are dimensions. We’ll explore this phenomenon in Arc II.
6.9 From Stream to Geometry
We’ve established what the residual stream is and why it matters. But we’ve been vague about what’s in it—talking about “features” and “directions” without precision.
The residual stream is a vector space. At each position, for each layer, we have a point in \(\mathbb{R}^{768}\) (or whatever the model dimension is). The “features” we’ve mentioned are directions in this space. The “predictions” are projections onto vocabulary directions.
To understand what the model represents, we need to understand this geometry. What do the directions mean? How are features organized? How can we find them?
This geometric perspective is the subject of the next chapter. Once we understand activations as geometry, we’ll be ready to tackle the core questions: What are features? How do they compose? And why is interpretation both possible and hard?
6.10 Mini Case Study: Watching a Prediction Form
Let’s apply the residual stream perspective to a real example. This is your first taste of the analysis workflow—we’ll go deeper in Arc III.
The task: GPT-2 predicts the next token for “The Eiffel Tower is located in”. We want to see when the model commits to predicting “Paris.”
import transformer_lens as tl
model = tl.HookedTransformer.from_pretrained("gpt2-small")
prompt = "The Eiffel Tower is located in"
tokens = model.to_tokens(prompt)
# Get the token ID for " Paris"
paris_token = model.to_single_token(" Paris")
# Run with cache to get all intermediate states
logits, cache = model.run_with_cache(tokens)
# Apply logit lens at each layer
print("Layer | Logit for 'Paris' | Top prediction")
print("-" * 50)
for layer in range(model.cfg.n_layers):
# Get residual stream at this layer
resid = cache["resid_post", layer][0, -1, :] # Last position
# Project to vocabulary (logit lens)
layer_logits = resid @ model.W_U
# Get Paris logit and top prediction
paris_logit = layer_logits[paris_token].item()
top_token = layer_logits.argmax().item()
top_word = model.tokenizer.decode(top_token)
print(f" {layer:2d} | {paris_logit:+.2f} | {top_word}")Typical output (approximate):
Layer | Logit for 'Paris' | Top prediction
--------------------------------------------------
0 | -2.34 | the
1 | -1.89 | the
2 | -0.45 | France
3 | +1.23 | France
4 | +3.45 | Paris
5 | +5.67 | Paris
...
11 | +12.34 | Paris
What we see: - Early layers (0-2): The model predicts common words. It hasn’t “decided” yet. - Middle layers (3-4): The model starts predicting relevant words (“France”). Information is being retrieved. - Later layers (5-11): The model commits to “Paris” with increasing confidence.
The prediction doesn’t appear instantly—it develops through layers. Each layer’s components read from the residual stream, compute something, and write back. We can watch this process unfold.
Questions this raises (that we’ll answer in later chapters): - Which components are responsible for the jump at layer 4? - What information is being retrieved, and from where? - Could we intervene to change the prediction?
This is a taste of the residual stream in action. We’ll develop these techniques systematically in Arc III.
6.11 Looking Ahead
The residual stream reframes the transformer from “sequential layer processing” to “parallel component contributions.” This isn’t just a pedagogical shift—it’s the foundation for mechanistic interpretability.
With this framework, we can: - Decompose predictions into component contributions - Trace information flow through paths - Identify circuits by their reading and writing patterns - Test hypotheses by intervening on specific components
In the next chapter, we’ll zoom in on the geometry of the residual stream. What does it mean for features to be “directions”? How are concepts arranged in this high-dimensional space? And what tools do we have for navigating it?
The residual stream gives us the architecture. Geometry will give us the language.
6.12 Key Takeaways
┌────────────────────────────────────────────────────────────┐
│ THE RESIDUAL STREAM │
├────────────────────────────────────────────────────────────┤
│ │
│ WHAT IT IS: A shared vector that flows through the │
│ entire network, accumulating information │
│ │
│ KEY INSIGHT: Output = embedding + Σ(all contributions) │
│ Everything is ADDITIVE, not sequential │
│ │
│ WHY IT MATTERS: │
│ ✓ Enables decomposition (who contributed what?) │
│ ✓ Enables intervention (what if we change this?) │
│ ✓ Enables path tracing (how did info flow?) │
│ │
│ MENTAL MODEL: Shared whiteboard - components read │
│ from it and write to it, never erasing │
│ │
└────────────────────────────────────────────────────────────┘
6.13 Check Your Understanding
Answer: Because the residual stream uses addition. Each component’s output is added to the stream, and matrix multiplication (used for the final unembedding) distributes over addition. So:
logits = (embedding + head1 + head2 + ... + mlp1 + ...) × W_unembed
becomes:
logits = embedding×W + head1×W + head2×W + ...
Each term is one component’s contribution.
Answer: They communicate through the residual stream. Head 5.3 writes its output (a vector) by adding it to the stream. That information persists in the stream. When MLP 8 reads from the stream three layers later, Head 5.3’s contribution is still there as part of the accumulated sum. Components never communicate directly—only through this shared workspace.
Answer: The logit lens shows what the model would predict if we stopped at an intermediate layer and decoded immediately. It works because the residual stream maintains linear structure relative to the vocabulary throughout the forward pass. Early layers show vague predictions; later layers show refined, confident predictions. This reveals the progressive refinement of information as it flows through the network.
6.14 Further Reading
A Mathematical Framework for Transformer Circuits — Anthropic: The foundational paper introducing the residual stream perspective and path analysis.
In-context Learning and Induction Heads — Anthropic: Deep dive into induction heads, the canonical example of head composition through the residual stream.
AXRP Episode 19: Mechanistic Interpretability with Neel Nanda — AXRP: Neel Nanda explains the residual stream, composition, and interpretability techniques.
Logit Lens — LessWrong: The original post introducing the logit lens technique for reading intermediate residual stream states.
Eliciting Latent Predictions from Transformers with the Tuned Lens — arXiv:2303.08112: Improving on the logit lens with learned per-layer transformations.
Exploring the Residual Stream of Transformers — arXiv:2312.12141: Recent work on interpreting residual stream contributions to predictions.