15  Attribution

Working backwards from the output

techniques
attribution
Author

Taras Tsugrii

Published

January 5, 2025

TipWhat You’ll Learn
  • How the residual stream’s additive structure enables attribution
  • Logit attribution: measuring per-component contributions to predictions
  • Attention pattern analysis: what does each head look at?
  • The limitations of attribution (correlation ≠ causation)
WarningPrerequisites

Required: Chapter 3: Residual Stream — understanding additive contributions Recommended: Chapter 8: Circuits — understanding what we’re trying to find

NoteBefore You Read: Recall

From prior chapters, recall:

  • The residual stream is a sum of contributions from all components (Chapter 3)
  • We can decompose this sum to see what each component contributed
  • SAEs extract interpretable features from superposed representations (Chapter 9)

We have features. Now we ask: Which features contributed to a specific output?

15.1 Starting from the Answer

In Chapter 9, we learned how to extract features from superposition using sparse autoencoders. We now have a vocabulary: monosemantic features that correspond to interpretable concepts.

But having features isn’t the same as understanding how they produce outputs. When the model predicts “Paris” after “The capital of France is”, which features were involved? Which attention heads mattered? Which layers did the real work?

To answer these questions, we use attribution: the technique of decomposing an output into contributions from individual components.

NotePolya’s Heuristic: Work Backwards

Attribution embodies one of Polya’s most powerful problem-solving strategies: start from the answer and work backwards. Instead of tracing forward from input to output (which is computationally expensive and conceptually overwhelming), we start from the output we want to explain and ask: “What contributed to this?”

15.2 The Residual Stream Enables Attribution

Attribution is possible because of the residual stream architecture we covered in Chapter 3.

Recall: the final output is a sum of contributions. Each attention head and each MLP adds its output to the residual stream. The final logits are computed by multiplying this accumulated stream by the unembedding matrix:

\[\text{logits} = x_L \cdot W_{\text{unembed}}\]

But \(x_L\) is a sum:

\[x_L = x_0 + \sum_{\text{heads}} h_i + \sum_{\text{MLPs}} m_j\]

Since matrix multiplication distributes over addition:

\[\text{logits} = x_0 \cdot W_{\text{unembed}} + \sum_{\text{heads}} h_i \cdot W_{\text{unembed}} + \sum_{\text{MLPs}} m_j \cdot W_{\text{unembed}}\]

Each term is a component’s contribution to the logits. We can literally measure how much each attention head and each MLP pushed the prediction toward or away from any particular token.

ImportantThe Key Insight

Because transformers accumulate contributions additively, we can decompose the output into per-component contributions. This is what makes attribution tractable.

flowchart LR
    E["Embedding<br/>+0.5"] --> SUM(("Σ"))
    H1["Head 5.2<br/>+0.9"] --> SUM
    H2["Head 7.4<br/>+1.8"] --> SUM
    M1["MLP 8<br/>+2.3"] --> SUM
    H3["Head 6.1<br/>-0.3"] --> SUM
    OTHER["Other<br/>+0.1"] --> SUM
    SUM --> OUT["Logit for 'Paris'<br/>= 5.3"]

Attribution decomposes the final logit into per-component contributions. Each component’s output projects onto the vocabulary direction.

15.3 Logit Attribution

The most basic form of attribution: for a given output token, measure how much each component contributed to that token’s logit.

15.3.1 The Procedure

  1. Run a forward pass, caching each component’s output
  2. For each component (attention head or MLP), compute: \[\text{contribution}_i = h_i \cdot W_{\text{unembed}}[:, \text{token}]\] This is the component’s output projected onto the direction that increases the target token’s logit
  3. Rank components by contribution magnitude

15.3.2 What It Shows

If the model predicts “Paris”, logit attribution might show: - MLP layer 8: contributed +2.3 to “Paris” logit - Attention head 7.4: contributed +1.8 - Attention head 5.2: contributed +0.9 - Attention head 6.1: contributed -0.3 (pushed away from “Paris”)

This immediately tells you where to look: MLP 8 and head 7.4 are the major contributors.

15.3.3 A Concrete Example

For the prompt “The capital of France is ___“:

Component Contribution to “Paris”
MLP Layer 8 +2.31
Head 7.4 +1.84
Head 5.2 +0.92
Embedding +0.45
Head 6.1 -0.28

The model’s confidence in “Paris” is the sum of all contributions. We can now ask: what is MLP 8 doing? What pattern is head 7.4 attending to?

WarningPause and Think

A component has high attribution for “Paris.” Does this prove it’s responsible for the model knowing Paris is France’s capital? Or could something else explain the high attribution?

Hint: Think about the difference between correlation and causation.

Let’s trace through the calculation for one component (Head 7.4):

Step 1: Get the head’s output vector

h_7.4 = [0.23, -0.15, 0.87, ..., 0.42]  # 768 dimensions

Step 2: Get the “Paris” column from the unembedding matrix

W_unembed[:, paris_id] = [0.12, 0.34, 0.56, ..., 0.78]  # 768 dimensions

Step 3: Compute the dot product (projection)

contribution = h_7.4 · W_unembed[:, paris_id]
             = (0.23 × 0.12) + (-0.15 × 0.34) + (0.87 × 0.56) + ... + (0.42 × 0.78)
             = +1.84

Interpretation: Head 7.4’s output, when projected onto the “Paris” direction in vocabulary space, contributes +1.84 to the logit. A positive value means this head is pushing toward predicting “Paris”.

The same calculation is repeated for every component. The final logit is the sum:

logit("Paris") = Σ all contributions = 2.31 + 1.84 + 0.92 + 0.45 - 0.28 + ... = 5.24

15.4 The Logit Lens: Peeking at Intermediate Predictions

Attribution tells you which components contributed. But when did the model “know” the answer?

The logit lens lets you peek at the model’s “best guess” at each layer.

15.4.1 How It Works

At any layer \(L\), take the current residual stream state and apply the unembedding matrix—as if this were the final layer:

\[\text{logits}_L = x_L \cdot W_{\text{unembed}}\]

This gives you: “If we stopped at layer \(L\) and predicted right now, what would the model say?”

15.4.2 What It Reveals

For “The capital of France is ___“:

Layer Top Prediction Probability
0-2 “the” 0.12
3-5 “France” 0.18
6-8 “Paris” 0.45
9-11 “Paris” 0.72
12 “Paris” 0.89

The model’s prediction refines through layers. Early layers don’t know the answer. Middle layers start to guess. Late layers are confident.

This reveals that knowledge retrieval happens in the middle layers (6-8 in this example), and later layers sharpen the prediction.

15.4.3 Implementation

for layer in range(n_layers):
    intermediate_logits = residual_stream[layer] @ unembed_matrix
    top_token = intermediate_logits.argmax()
    print(f"Layer {layer}: predicts '{tokenizer.decode(top_token)}'")

15.5 The Tuned Lens: A Refinement

The basic logit lens has a problem: the unembedding matrix is trained to work on final layer representations, not intermediate ones. Intermediate layers have different statistical properties, so applying the unembedding directly is noisy.

The tuned lens fixes this by learning a small transformation for each layer:

\[\text{logits}_L = (x_L \cdot A_L + b_L) \cdot W_{\text{unembed}}\]

where \(A_L\) and \(b_L\) are learned per-layer parameters that “translate” the intermediate representation into the format the unembedding expects.

15.5.1 The Improvement

With the tuned lens: - Intermediate predictions are cleaner and more reliable - You see sharper transitions between “don’t know” and “know” - Layer-by-layer progression is more interpretable

15.5.2 What It Learns

The learned matrices reveal something about each layer’s representational format—how each layer “stores” information and what transformation is needed to read it.

15.6 Head-Level Attribution

Beyond layer-level analysis, we can break down contributions by individual attention heads.

Each transformer layer has multiple attention heads (often 12 or more). Each head adds its own contribution to the residual stream. We can measure each head’s contribution separately.

15.6.1 Why This Matters

Different heads specialize in different tasks: - Copy heads: Move information from one position to another - Induction heads: Detect and continue patterns - Inhibition heads: Suppress incorrect predictions - Name mover heads: Move names from mentioned positions to output

Head-level attribution identifies which heads are active and in what direction they push the prediction.

15.6.2 Patterns Across Inputs

Running attribution across many inputs reveals head specialization: - Head 7.4 consistently has high attribution on factual recall tasks - Head 5.2 activates mainly for syntactic pattern completion - Head 3.1 seems to matter only for specific domains (code, math)

This guides circuit discovery: heads with consistently high attribution for a task are candidates for that task’s circuit.

15.7 Feature Attribution with SAEs

With sparse autoencoders (Chapter 9), we can go beyond heads to features.

Instead of asking “which heads contributed?”, we ask “which concepts contributed?”

15.7.1 The Procedure

  1. Run the activation through a trained SAE to get feature activations
  2. Each feature has a direction and an activation magnitude
  3. Compute each feature’s contribution: \[\text{contribution}_f = \text{activation}_f \times (d_f \cdot W_{\text{unembed}}[:, \text{token}])\] where \(d_f\) is the feature’s direction (decoder column)

15.7.2 What It Shows

Instead of “MLP layer 8 contributed +2.3”, we might see: - “France-capital” feature: +1.2 - “European cities” feature: +0.8 - “Proper noun” feature: +0.3

This is interpretable attribution: we can understand the concepts that drove the prediction, not just the components.

15.7.3 The Power of Feature Attribution

Feature attribution enables: - Understanding what the model is thinking, not just where computation happens - Finding when the model uses unexpected features (potential errors or biases) - Tracing how concepts flow through the network

TipA Performance Engineering Parallel

Feature attribution is like semantic profiling. Regular profiling tells you “this function took 50ms.” Semantic profiling would tell you “50ms were spent computing customer discounts.” Feature attribution does the semantic version for neural networks: not just where computation happens, but what it’s about.

15.8 The Critical Limitation: Correlation vs. Causation

Here’s the most important thing to understand about attribution:

Attribution shows correlation, not causation.

When we find that head 7.4 contributed +1.84 to the “Paris” logit, we’ve learned: - Head 7.4’s output, when projected to vocabulary space, points toward “Paris” - This contribution was added to the residual stream

We have not learned: - Whether head 7.4 is necessary for predicting “Paris” - Whether removing head 7.4 would change the prediction - Whether the same information might flow through other paths

15.8.1 Why This Matters

Consider two scenarios:

Scenario A: Head 7.4 computes the answer and writes it to the residual stream. Without head 7.4, the model would fail.

Scenario B: Head 7.4 reads the answer from the residual stream (computed by something earlier) and passes it along. Without head 7.4, other heads would still carry the answer.

In both scenarios, head 7.4 has high attribution. But in scenario A, it’s causally necessary; in scenario B, it’s redundant.

Attribution can’t distinguish these cases.

CautionCommon Misconception: High Attribution = High Importance

Wrong: “The component with highest attribution is the most important for this behavior.”

Right: High attribution means the component’s output correlates with the prediction—it points in the same direction. But the same information may flow through multiple redundant paths. The “most important” component is often the one that’s necessary (which requires patching or ablation to determine), not just the one with the highest correlation.

ImportantThe Fundamental Limitation

Attribution is like a profiler. It tells you where the compute goes—not why it goes there, or what would happen if you changed it. To understand causation, you need to intervene, not just observe. This is the subject of the next chapter.

15.9 The Profiling Analogy

If you’ve done performance engineering, attribution should feel familiar.

Performance Profiling Attribution
“Function F used 30% of CPU time” “Head H contributed 30% to the logit”
Shows where time is spent Shows what contributes to output
Doesn’t explain why it’s slow Doesn’t explain why it matters
Might reflect redundant work Might reflect redundant paths
Requires follow-up investigation Requires follow-up intervention

The profiling mindset applies directly:

  1. Measure first: Attribution gives you the lay of the land
  2. Form hypotheses: “Head 7.4 seems to be the key contributor”
  3. Don’t stop there: Profiling can lie; attribution can mislead
  4. Test with intervention: Actually modify the system and measure impact

15.9.1 Never Trust Unvalidated Attribution

Just as an experienced performance engineer knows that profiler output requires interpretation, an interpretability researcher knows that high attribution doesn’t prove importance.

The hot function in a profile might be: - Actually slow (optimize it!) - Called redundantly (fix the caller) - Waiting on I/O (not CPU-bound at all)

The high-attribution head might be: - Actually necessary (it computes the answer) - Redundant (other paths carry the same info) - A passthrough (it reads and forwards, doesn’t compute)

You need intervention to distinguish these cases.

15.10 From Attribution to Patching

Attribution narrows your search space. Instead of investigating 144 attention heads, you focus on the 5-10 with high attribution.

But attribution is a hypothesis, not a conclusion.

The next chapter introduces activation patching: the technique for testing whether attributed components are actually causally necessary. Patching is to attribution what controlled experiments are to observation.

Together, they form a complete methodology: 1. Attribution: What did contribute? (observation) 2. Patching: What must contribute? (intervention) 3. Ablation: What happens if we remove it? (Chapter 12)

In practice, you never use attribution alone. Here’s the standard workflow:

flowchart LR
    A["Run attribution<br/>(cheap, ~1 forward pass)"] --> B["Identify top-10<br/>components"]
    B --> C["Run patching on<br/>candidates<br/>(expensive, many passes)"]
    C --> D{"Does patching<br/>confirm attribution?"}
    D -->|Yes| E["Component is<br/>causally necessary"]
    D -->|No| F["Component is<br/>redundant or<br/>correlated"]

15.10.1 Why This Order?

Step Cost Purpose
Attribution 1-2 forward passes Generate hypotheses—find candidates
Patching ~100 forward passes per component Test hypotheses—verify causality

Attribution is cheap; patching is expensive. Use attribution to narrow your search, then patch only the promising candidates.

15.10.2 Practical Example: “The capital of France is ___”

Step 1: Attribution (5 minutes)

# Run once, get contribution of every component
contributions = model.logit_attribution(prompt, target="Paris")
top_components = contributions.sort()[:10]
# Result: MLP 8 (+2.3), Head 7.4 (+1.8), Head 5.2 (+0.9), ...

Step 2: Patching on top candidates (30 minutes)

# For each top component, patch from corrupted input
for component in top_components:
    recovery = patch_from_corrupted(component)
    # If recovery > 0.5, this component is causally important

Step 3: Interpret results - MLP 8: patching recovers 78% → causally necessary - Head 7.4: patching recovers 65% → causally important - Head 5.2: patching recovers 8% → correlated but redundant

15.10.3 When Attribution ≠ Patching

Sometimes high attribution doesn’t survive patching:

Scenario High Attribution? Patching Effect? Interpretation
Sole path Yes High This component computes the answer
Redundant path Yes Low Other paths also carry the answer
Passthrough Yes Medium Reads from earlier, writes to later
Suppression Negative High Actively suppresses wrong answers

The “redundant path” case is common in large models with backup circuits. Attribution sees contribution; patching reveals redundancy.

15.10.4 Cost-Benefit Summary

  • 1 task, unknown circuit: Use attribution to find candidates, then patch top-10
  • Known circuit, new inputs: Skip attribution, patch directly
  • Exploratory research: Attribution across 100 inputs to find consistent patterns
WarningCommon Mistake: Skipping to Patching

It’s tempting to “just run patching on everything.” Don’t. For a model with 144 attention heads + 12 MLPs, full patching requires ~15,000 forward passes. Attribution gives you the same hypothesis in 2 passes.

Attribution is your cheap filter. Use it.

15.11 Polya’s Perspective: Working Backwards

Attribution is Polya’s “work backwards” heuristic in action.

The forward problem—trace information from input to output—is overwhelming. There are billions of paths through the network, and most are irrelevant.

The backward problem—start from the output and ask “what caused this?”—is tractable. Attribution tells us which components matter for this specific output, cutting the search space dramatically.

TipPolya’s Insight

“Start from what you want to prove.” In mathematics, this means assuming the result and working backwards to the premises. In interpretability, it means starting from the output and tracing back to the components. Both approaches tame otherwise intractable forward searches.

15.12 Looking Ahead

Attribution gives us hypotheses: “Head 7.4 matters for this prediction.”

But hypotheses need testing. The next chapter covers activation patching—the technique for interventional verification. We’ll learn to ask not just “what correlates with the output?” but “what causes the output?”

This is the difference between observational and experimental science. Attribution is observation. Patching is experiment.


15.13 Further Reading

  1. A Mathematical Framework for Transformer CircuitsAnthropic: The foundational paper on logit attribution and residual stream decomposition.

  2. Eliciting Latent Predictions with the Tuned LensarXiv:2303.08112: The definitive paper on improving the logit lens with learned transformations.

  3. Interpreting GPT: The Logit LensLessWrong: The original post introducing the logit lens technique.

  4. Attribution PatchingNeel Nanda: How to scale from attribution to patching efficiently.

  5. An Adversarial Example for Direct Logit AttributionarXiv:2310.07325: Important paper showing limitations and failure modes of naive attribution.

  6. TransformerLens DocumentationGitHub: The primary library for running attribution experiments on transformers.