flowchart LR
E["Embedding<br/>+0.5"] --> SUM(("Σ"))
H1["Head 5.2<br/>+0.9"] --> SUM
H2["Head 7.4<br/>+1.8"] --> SUM
M1["MLP 8<br/>+2.3"] --> SUM
H3["Head 6.1<br/>-0.3"] --> SUM
OTHER["Other<br/>+0.1"] --> SUM
SUM --> OUT["Logit for 'Paris'<br/>= 5.3"]
15 Attribution
Working backwards from the output
- How the residual stream’s additive structure enables attribution
- Logit attribution: measuring per-component contributions to predictions
- Attention pattern analysis: what does each head look at?
- The limitations of attribution (correlation ≠ causation)
Required: Chapter 3: Residual Stream — understanding additive contributions Recommended: Chapter 8: Circuits — understanding what we’re trying to find
From prior chapters, recall:
- The residual stream is a sum of contributions from all components (Chapter 3)
- We can decompose this sum to see what each component contributed
- SAEs extract interpretable features from superposed representations (Chapter 9)
We have features. Now we ask: Which features contributed to a specific output?
15.1 Starting from the Answer
In Chapter 9, we learned how to extract features from superposition using sparse autoencoders. We now have a vocabulary: monosemantic features that correspond to interpretable concepts.
But having features isn’t the same as understanding how they produce outputs. When the model predicts “Paris” after “The capital of France is”, which features were involved? Which attention heads mattered? Which layers did the real work?
To answer these questions, we use attribution: the technique of decomposing an output into contributions from individual components.
Attribution embodies one of Polya’s most powerful problem-solving strategies: start from the answer and work backwards. Instead of tracing forward from input to output (which is computationally expensive and conceptually overwhelming), we start from the output we want to explain and ask: “What contributed to this?”
15.2 The Residual Stream Enables Attribution
Attribution is possible because of the residual stream architecture we covered in Chapter 3.
Recall: the final output is a sum of contributions. Each attention head and each MLP adds its output to the residual stream. The final logits are computed by multiplying this accumulated stream by the unembedding matrix:
\[\text{logits} = x_L \cdot W_{\text{unembed}}\]
But \(x_L\) is a sum:
\[x_L = x_0 + \sum_{\text{heads}} h_i + \sum_{\text{MLPs}} m_j\]
Since matrix multiplication distributes over addition:
\[\text{logits} = x_0 \cdot W_{\text{unembed}} + \sum_{\text{heads}} h_i \cdot W_{\text{unembed}} + \sum_{\text{MLPs}} m_j \cdot W_{\text{unembed}}\]
Each term is a component’s contribution to the logits. We can literally measure how much each attention head and each MLP pushed the prediction toward or away from any particular token.
Because transformers accumulate contributions additively, we can decompose the output into per-component contributions. This is what makes attribution tractable.
15.3 Logit Attribution
The most basic form of attribution: for a given output token, measure how much each component contributed to that token’s logit.
15.3.1 The Procedure
- Run a forward pass, caching each component’s output
- For each component (attention head or MLP), compute: \[\text{contribution}_i = h_i \cdot W_{\text{unembed}}[:, \text{token}]\] This is the component’s output projected onto the direction that increases the target token’s logit
- Rank components by contribution magnitude
15.3.2 What It Shows
If the model predicts “Paris”, logit attribution might show: - MLP layer 8: contributed +2.3 to “Paris” logit - Attention head 7.4: contributed +1.8 - Attention head 5.2: contributed +0.9 - Attention head 6.1: contributed -0.3 (pushed away from “Paris”)
This immediately tells you where to look: MLP 8 and head 7.4 are the major contributors.
15.3.3 A Concrete Example
For the prompt “The capital of France is ___“:
| Component | Contribution to “Paris” |
|---|---|
| MLP Layer 8 | +2.31 |
| Head 7.4 | +1.84 |
| Head 5.2 | +0.92 |
| Embedding | +0.45 |
| Head 6.1 | -0.28 |
| … | … |
The model’s confidence in “Paris” is the sum of all contributions. We can now ask: what is MLP 8 doing? What pattern is head 7.4 attending to?
A component has high attribution for “Paris.” Does this prove it’s responsible for the model knowing Paris is France’s capital? Or could something else explain the high attribution?
Hint: Think about the difference between correlation and causation.
Let’s trace through the calculation for one component (Head 7.4):
Step 1: Get the head’s output vector
h_7.4 = [0.23, -0.15, 0.87, ..., 0.42] # 768 dimensions
Step 2: Get the “Paris” column from the unembedding matrix
W_unembed[:, paris_id] = [0.12, 0.34, 0.56, ..., 0.78] # 768 dimensions
Step 3: Compute the dot product (projection)
contribution = h_7.4 · W_unembed[:, paris_id]
= (0.23 × 0.12) + (-0.15 × 0.34) + (0.87 × 0.56) + ... + (0.42 × 0.78)
= +1.84
Interpretation: Head 7.4’s output, when projected onto the “Paris” direction in vocabulary space, contributes +1.84 to the logit. A positive value means this head is pushing toward predicting “Paris”.
The same calculation is repeated for every component. The final logit is the sum:
logit("Paris") = Σ all contributions = 2.31 + 1.84 + 0.92 + 0.45 - 0.28 + ... = 5.24
15.4 The Logit Lens: Peeking at Intermediate Predictions
Attribution tells you which components contributed. But when did the model “know” the answer?
The logit lens lets you peek at the model’s “best guess” at each layer.
15.4.1 How It Works
At any layer \(L\), take the current residual stream state and apply the unembedding matrix—as if this were the final layer:
\[\text{logits}_L = x_L \cdot W_{\text{unembed}}\]
This gives you: “If we stopped at layer \(L\) and predicted right now, what would the model say?”
15.4.2 What It Reveals
For “The capital of France is ___“:
| Layer | Top Prediction | Probability |
|---|---|---|
| 0-2 | “the” | 0.12 |
| 3-5 | “France” | 0.18 |
| 6-8 | “Paris” | 0.45 |
| 9-11 | “Paris” | 0.72 |
| 12 | “Paris” | 0.89 |
The model’s prediction refines through layers. Early layers don’t know the answer. Middle layers start to guess. Late layers are confident.
This reveals that knowledge retrieval happens in the middle layers (6-8 in this example), and later layers sharpen the prediction.
15.4.3 Implementation
15.5 The Tuned Lens: A Refinement
The basic logit lens has a problem: the unembedding matrix is trained to work on final layer representations, not intermediate ones. Intermediate layers have different statistical properties, so applying the unembedding directly is noisy.
The tuned lens fixes this by learning a small transformation for each layer:
\[\text{logits}_L = (x_L \cdot A_L + b_L) \cdot W_{\text{unembed}}\]
where \(A_L\) and \(b_L\) are learned per-layer parameters that “translate” the intermediate representation into the format the unembedding expects.
15.5.1 The Improvement
With the tuned lens: - Intermediate predictions are cleaner and more reliable - You see sharper transitions between “don’t know” and “know” - Layer-by-layer progression is more interpretable
15.5.2 What It Learns
The learned matrices reveal something about each layer’s representational format—how each layer “stores” information and what transformation is needed to read it.
15.6 Head-Level Attribution
Beyond layer-level analysis, we can break down contributions by individual attention heads.
Each transformer layer has multiple attention heads (often 12 or more). Each head adds its own contribution to the residual stream. We can measure each head’s contribution separately.
15.6.1 Why This Matters
Different heads specialize in different tasks: - Copy heads: Move information from one position to another - Induction heads: Detect and continue patterns - Inhibition heads: Suppress incorrect predictions - Name mover heads: Move names from mentioned positions to output
Head-level attribution identifies which heads are active and in what direction they push the prediction.
15.6.2 Patterns Across Inputs
Running attribution across many inputs reveals head specialization: - Head 7.4 consistently has high attribution on factual recall tasks - Head 5.2 activates mainly for syntactic pattern completion - Head 3.1 seems to matter only for specific domains (code, math)
This guides circuit discovery: heads with consistently high attribution for a task are candidates for that task’s circuit.
15.7 Feature Attribution with SAEs
With sparse autoencoders (Chapter 9), we can go beyond heads to features.
Instead of asking “which heads contributed?”, we ask “which concepts contributed?”
15.7.1 The Procedure
- Run the activation through a trained SAE to get feature activations
- Each feature has a direction and an activation magnitude
- Compute each feature’s contribution: \[\text{contribution}_f = \text{activation}_f \times (d_f \cdot W_{\text{unembed}}[:, \text{token}])\] where \(d_f\) is the feature’s direction (decoder column)
15.7.2 What It Shows
Instead of “MLP layer 8 contributed +2.3”, we might see: - “France-capital” feature: +1.2 - “European cities” feature: +0.8 - “Proper noun” feature: +0.3
This is interpretable attribution: we can understand the concepts that drove the prediction, not just the components.
15.7.3 The Power of Feature Attribution
Feature attribution enables: - Understanding what the model is thinking, not just where computation happens - Finding when the model uses unexpected features (potential errors or biases) - Tracing how concepts flow through the network
Feature attribution is like semantic profiling. Regular profiling tells you “this function took 50ms.” Semantic profiling would tell you “50ms were spent computing customer discounts.” Feature attribution does the semantic version for neural networks: not just where computation happens, but what it’s about.
15.8 The Critical Limitation: Correlation vs. Causation
Here’s the most important thing to understand about attribution:
Attribution shows correlation, not causation.
When we find that head 7.4 contributed +1.84 to the “Paris” logit, we’ve learned: - Head 7.4’s output, when projected to vocabulary space, points toward “Paris” - This contribution was added to the residual stream
We have not learned: - Whether head 7.4 is necessary for predicting “Paris” - Whether removing head 7.4 would change the prediction - Whether the same information might flow through other paths
15.8.1 Why This Matters
Consider two scenarios:
Scenario A: Head 7.4 computes the answer and writes it to the residual stream. Without head 7.4, the model would fail.
Scenario B: Head 7.4 reads the answer from the residual stream (computed by something earlier) and passes it along. Without head 7.4, other heads would still carry the answer.
In both scenarios, head 7.4 has high attribution. But in scenario A, it’s causally necessary; in scenario B, it’s redundant.
Attribution can’t distinguish these cases.
Wrong: “The component with highest attribution is the most important for this behavior.”
Right: High attribution means the component’s output correlates with the prediction—it points in the same direction. But the same information may flow through multiple redundant paths. The “most important” component is often the one that’s necessary (which requires patching or ablation to determine), not just the one with the highest correlation.
Attribution is like a profiler. It tells you where the compute goes—not why it goes there, or what would happen if you changed it. To understand causation, you need to intervene, not just observe. This is the subject of the next chapter.
15.9 The Profiling Analogy
If you’ve done performance engineering, attribution should feel familiar.
| Performance Profiling | Attribution |
|---|---|
| “Function F used 30% of CPU time” | “Head H contributed 30% to the logit” |
| Shows where time is spent | Shows what contributes to output |
| Doesn’t explain why it’s slow | Doesn’t explain why it matters |
| Might reflect redundant work | Might reflect redundant paths |
| Requires follow-up investigation | Requires follow-up intervention |
The profiling mindset applies directly:
- Measure first: Attribution gives you the lay of the land
- Form hypotheses: “Head 7.4 seems to be the key contributor”
- Don’t stop there: Profiling can lie; attribution can mislead
- Test with intervention: Actually modify the system and measure impact
15.9.1 Never Trust Unvalidated Attribution
Just as an experienced performance engineer knows that profiler output requires interpretation, an interpretability researcher knows that high attribution doesn’t prove importance.
The hot function in a profile might be: - Actually slow (optimize it!) - Called redundantly (fix the caller) - Waiting on I/O (not CPU-bound at all)
The high-attribution head might be: - Actually necessary (it computes the answer) - Redundant (other paths carry the same info) - A passthrough (it reads and forwards, doesn’t compute)
You need intervention to distinguish these cases.
15.10 From Attribution to Patching
Attribution narrows your search space. Instead of investigating 144 attention heads, you focus on the 5-10 with high attribution.
But attribution is a hypothesis, not a conclusion.
The next chapter introduces activation patching: the technique for testing whether attributed components are actually causally necessary. Patching is to attribution what controlled experiments are to observation.
Together, they form a complete methodology: 1. Attribution: What did contribute? (observation) 2. Patching: What must contribute? (intervention) 3. Ablation: What happens if we remove it? (Chapter 12)
In practice, you never use attribution alone. Here’s the standard workflow:
flowchart LR
A["Run attribution<br/>(cheap, ~1 forward pass)"] --> B["Identify top-10<br/>components"]
B --> C["Run patching on<br/>candidates<br/>(expensive, many passes)"]
C --> D{"Does patching<br/>confirm attribution?"}
D -->|Yes| E["Component is<br/>causally necessary"]
D -->|No| F["Component is<br/>redundant or<br/>correlated"]
15.10.1 Why This Order?
| Step | Cost | Purpose |
|---|---|---|
| Attribution | 1-2 forward passes | Generate hypotheses—find candidates |
| Patching | ~100 forward passes per component | Test hypotheses—verify causality |
Attribution is cheap; patching is expensive. Use attribution to narrow your search, then patch only the promising candidates.
15.10.2 Practical Example: “The capital of France is ___”
Step 1: Attribution (5 minutes)
Step 2: Patching on top candidates (30 minutes)
Step 3: Interpret results - MLP 8: patching recovers 78% → causally necessary - Head 7.4: patching recovers 65% → causally important - Head 5.2: patching recovers 8% → correlated but redundant
15.10.3 When Attribution ≠ Patching
Sometimes high attribution doesn’t survive patching:
| Scenario | High Attribution? | Patching Effect? | Interpretation |
|---|---|---|---|
| Sole path | Yes | High | This component computes the answer |
| Redundant path | Yes | Low | Other paths also carry the answer |
| Passthrough | Yes | Medium | Reads from earlier, writes to later |
| Suppression | Negative | High | Actively suppresses wrong answers |
The “redundant path” case is common in large models with backup circuits. Attribution sees contribution; patching reveals redundancy.
15.10.4 Cost-Benefit Summary
- 1 task, unknown circuit: Use attribution to find candidates, then patch top-10
- Known circuit, new inputs: Skip attribution, patch directly
- Exploratory research: Attribution across 100 inputs to find consistent patterns
It’s tempting to “just run patching on everything.” Don’t. For a model with 144 attention heads + 12 MLPs, full patching requires ~15,000 forward passes. Attribution gives you the same hypothesis in 2 passes.
Attribution is your cheap filter. Use it.
15.11 Polya’s Perspective: Working Backwards
Attribution is Polya’s “work backwards” heuristic in action.
The forward problem—trace information from input to output—is overwhelming. There are billions of paths through the network, and most are irrelevant.
The backward problem—start from the output and ask “what caused this?”—is tractable. Attribution tells us which components matter for this specific output, cutting the search space dramatically.
“Start from what you want to prove.” In mathematics, this means assuming the result and working backwards to the premises. In interpretability, it means starting from the output and tracing back to the components. Both approaches tame otherwise intractable forward searches.
15.12 Looking Ahead
Attribution gives us hypotheses: “Head 7.4 matters for this prediction.”
But hypotheses need testing. The next chapter covers activation patching—the technique for interventional verification. We’ll learn to ask not just “what correlates with the output?” but “what causes the output?”
This is the difference between observational and experimental science. Attribution is observation. Patching is experiment.
15.13 Further Reading
A Mathematical Framework for Transformer Circuits — Anthropic: The foundational paper on logit attribution and residual stream decomposition.
Eliciting Latent Predictions with the Tuned Lens — arXiv:2303.08112: The definitive paper on improving the logit lens with learned transformations.
Interpreting GPT: The Logit Lens — LessWrong: The original post introducing the logit lens technique.
Attribution Patching — Neel Nanda: How to scale from attribution to patching efficiently.
An Adversarial Example for Direct Logit Attribution — arXiv:2310.07325: Important paper showing limitations and failure modes of naive attribution.
TransformerLens Documentation — GitHub: The primary library for running attribution experiments on transformers.