flowchart LR
subgraph Clean
C1["Input A"] --> C2["Layer 2"] --> C3["Output: Mary ✓"]
end
subgraph Patched
P1["Input B"] --> P2["Layer 2<br/>(from Clean)"] --> P3["Output: Mary? ✓"]
end
C2 -.->|"copy"| P2
16 Activation Patching
Causal intervention in neural networks
- The clean/corrupted paradigm for causal intervention
- How activation patching isolates component contributions
- Path patching: tracing specific information flows
- Why patching proves causation where attribution only shows correlation
Required: Chapter 10: Attribution — understanding how to identify candidate components
From Chapter 10 (Attribution), recall:
- Attribution shows which components contributed to the output (correlation)
- The logit is a sum of per-component contributions
- But high attribution doesn’t prove a component is necessary
- Other paths might carry the same information (redundancy)
Attribution gives us hypotheses. Now we ask: How do we test whether a component is causally necessary?
16.1 From Observation to Experiment
In Chapter 10, we learned to ask: “What contributed to this output?” Attribution decomposes the logit into per-component contributions, showing which attention heads and MLPs pushed the prediction in which direction.
But attribution has a fundamental limitation: it shows correlation, not causation.
When head 7.4 has high attribution for predicting “Paris,” we know its output aligns with that prediction. We don’t know whether head 7.4 is necessary—whether the model would fail without it, or whether the same information flows through redundant paths.
To distinguish correlation from causation, we need to intervene—to modify the system and observe what happens. This is what activation patching provides: a way to surgically replace activations and measure the causal effect.
Imagine a patient has heart disease and lives in a polluted city. Both correlate with their condition. To prove the heart is the problem, you don’t just observe—you transplant a healthy heart and see if they recover.
Activation patching is the same. You “transplant” an activation from a healthy (clean) run into a sick (corrupted) run. If behavior recovers, you’ve proven that activation is causally responsible—not just correlated.
Activation patching replaces a component’s activation from one input with its activation from a different input, then measures how the output changes. If the output changes significantly, that component is causally important for the behavior.
16.2 The Clean/Corrupted Paradigm
The standard patching setup uses two inputs:
Clean input: Produces the behavior we want to understand Corrupted input: Produces a different (usually wrong) behavior
For the IOI task from Chapter 8: - Clean: “When John and Mary went to the store, John gave the bag to ” → predicts ”Mary” - Corrupted: ”When John and Mary went to the store, Mary gave the bag to ” → predicts “John”
The corrupted input is carefully constructed: it’s minimally different from the clean input but produces a different output. This ensures that any differences in activation are due to the specific computation we’re studying, not irrelevant factors.
Constructing effective clean/corrupted pairs is an art. Here’s a systematic approach:
16.2.1 The Checklist
1. Minimal difference: Change as few tokens as possible - ✓ Good: Swap “John gave to Mary” → “Mary gave to John” - ✗ Bad: Completely different sentence about giving
2. Same structure: Preserve length, syntax, and format - ✓ Good: “The capital of France is” → “The capital of Germany is” - ✗ Bad: “The capital of France is” → “Berlin is a city in Germany”
3. Matched statistics: Similar token frequencies and positions - ✓ Good: Replace “Paris” with “Berlin” (both city names, similar frequency) - ✗ Bad: Replace “Paris” with “xyzzy” (nonsense token, very different statistics)
4. Opposite output: Clean and corrupted should predict different (ideally opposite) targets - ✓ Good: Clean predicts “Mary”, corrupted predicts “John” - ✗ Bad: Both predict “Mary” with slightly different confidence
16.2.2 Examples by Task Type
| Task | Clean Input | Corrupted Input | What Changes |
|---|---|---|---|
| IOI | “John and Mary… John gave to ___” | “John and Mary… Mary gave to ___” | Subject/object swap |
| Factual recall | “The capital of France is” | “The capital of Germany is” | Country name |
| Greater-than | “The war lasted from 1914 to 19” | “The war lasted from 1918 to 19” | Start year |
| Sentiment | “This movie was absolutely wonderful” | “This movie was absolutely terrible” | Sentiment word |
16.2.3 Common Mistakes
Mistake 1: Too many differences If your corrupted input differs in 10 ways, you can’t isolate which difference matters. Keep it to 1-2 changes.
Mistake 2: Distribution shift If corrupted tokens are rare or unusual, activations will be out-of-distribution. Use common tokens.
Mistake 3: No ground truth You need to know what the “correct” output is for both inputs. Avoid ambiguous cases.
Mistake 4: Forgetting position Token position matters! If you swap “John” and “Mary,” make sure the positions are comparable.
You patch an attention head’s output and the model’s prediction changes from wrong to right. Does this prove the head causes the correct behavior? What alternative explanation might exist?
Hint: Consider what information the patched activation might contain besides the “target” information.
16.2.4 When Construction Is Hard
Some tasks don’t have natural clean/corrupted pairs: - Creative generation (no single “correct” output) - Open-ended reasoning (many valid paths) - Multi-step tasks (which step do you corrupt?)
For these, consider: - Using ablation instead of patching - Creating synthetic tasks with known structure - Focusing on sub-components of the task
16.2.5 The Patching Procedure
- Run both clean and corrupted inputs through the model
- Cache all intermediate activations for both runs
- At a specific location (layer, position, component), replace the corrupted activation with the clean activation
- Continue the forward pass with this patched activation
- Measure how the output changes
If patching a component restores the clean output, that component carries information necessary for the behavior.
# Pseudocode for activation patching
def patch_and_measure(model, clean_input, corrupted_input, patch_location):
# Get clean activations
clean_cache = model.run_with_cache(clean_input)
# Run corrupted input, but patch at specific location
def patch_hook(activation, hook):
if hook.name == patch_location:
return clean_cache[patch_location]
return activation
patched_output = model.run_with_hooks(corrupted_input, hooks=[patch_hook])
# Measure recovery toward clean behavior
clean_logit = model(clean_input).logits[target_token]
corrupted_logit = model(corrupted_input).logits[target_token]
patched_logit = patched_output.logits[target_token]
recovery = (patched_logit - corrupted_logit) / (clean_logit - corrupted_logit)
return recovery16.2.6 Measuring Recovery
The key metric is logit difference recovery: how much does patching restore the clean behavior?
\[\text{Recovery} = \frac{\text{logit}_{\text{patched}} - \text{logit}_{\text{corrupted}}}{\text{logit}_{\text{clean}} - \text{logit}_{\text{corrupted}}}\]
- Recovery ≈ 0%: Patching this component has no effect; it’s not causally important
- Recovery ≈ 100%: Patching this component fully restores the correct behavior; it’s critically important
- Recovery between 0-100%: The component contributes but isn’t solely responsible
Unlike attribution (which observes correlations), patching performs an intervention. If changing a component changes the output, that component is causally involved. This is the difference between observational and experimental science.
16.3 Types of Patching
Different patching targets reveal different aspects of the computation.
16.3.1 Residual Stream Patching
Patch the residual stream at a specific layer and position:
Layer L, Position P: Replace x_L^P(corrupted) with x_L^P(clean)
This tests: “Does the residual stream at this layer/position carry information necessary for the behavior?”
What it reveals: Where in the network (which layer, which position) the critical information exists.
Limitation: The residual stream is a sum of all prior contributions. High recovery at layer L could mean: - Layer L itself computes the answer - An earlier layer computed it, and layer L just carries it forward
16.3.2 Attention Head Patching
Patch the output of a specific attention head:
Head H at Layer L: Replace h_H^L(corrupted) with h_H^L(clean)
This tests: “Is this specific attention head causally necessary?”
What it reveals: Which heads are critical for the behavior.
The IOI Discovery: Patching revealed that only 26 out of 144 heads in GPT-2 Small are necessary for indirect object identification. The other 118 heads can be corrupted without affecting the task.
16.3.3 MLP Patching
Patch the output of an MLP layer:
MLP at Layer L: Replace m^L(corrupted) with m^L(clean)
This tests whether the nonlinear computations at this layer matter.
16.3.4 Position-Specific Patching
Patch only at specific token positions:
For “When John and Mary went to the store, John gave the bag to ___“: - Patch only at the”Mary” position (position 4) - Patch only at the second “John” position (position 9) - Patch only at the final position (where prediction happens)
What it reveals: Which positions carry the critical information at which layers.
Common Finding: Information flows from source positions (where names appear) to the final position (where prediction happens) through specific layers.
16.3.5 Feature Patching with SAEs
Using sparse autoencoders (Chapter 9), we can patch individual features:
Feature F: Replace activation_F(corrupted) with activation_F(clean)
This tests: “Is this specific concept causally necessary?”
Instead of “is head 7.4 necessary?” we ask “is the ‘France-capital’ feature necessary?”
What it reveals: Which interpretable features drive the behavior, not just which components.
16.4 Noising vs. Denoising
There are two directions to patch:
16.4.1 Denoising (Corrupted → Clean)
Start with corrupted behavior, patch in clean activations, measure recovery.
Question answered: “What information is sufficient to restore the correct behavior?”
Interpretation: High recovery means the patched component carries enough information to fix the behavior.
16.4.2 Noising (Clean → Corrupted)
Start with clean behavior, patch in corrupted activations, measure degradation.
Question answered: “What information is necessary for the correct behavior?”
Interpretation: High degradation means the patched component is essential; corrupting it breaks the behavior.
16.4.3 Why Both Matter
| Scenario | Denoising (corrupt→clean) | Noising (clean→corrupt) | Interpretation |
|---|---|---|---|
| A | High recovery | High degradation | Component is both necessary and sufficient |
| B | High recovery | Low degradation | Component is sufficient but redundant |
| C | Low recovery | High degradation | Component is necessary but not sufficient alone |
| D | Low recovery | Low degradation | Component is irrelevant |
The Full Picture: A component that shows high denoising recovery might still be redundant if other paths carry the same information. Noising reveals whether those backup paths exist.
Denoising is like fixing a bug by replacing a component—“if I use the correct version, does it work?” Noising is like introducing a bug—“if I break this component, does everything fail?” Both tests give different information about the system’s dependencies.
16.5 Path Patching
Beyond single-component patching, path patching traces how information flows between components.
16.5.1 The Idea
Instead of patching a component’s output directly, patch the effect of one component on another specific component.
For example: “Does head 5.2’s output affect head 7.4’s computation?”
16.5.2 The Procedure
- Run clean and corrupted inputs
- At head 7.4, when it reads from the residual stream, replace only the contribution from head 5.2
- Measure whether this targeted patch changes the output
If patching head 5.2’s contribution to head 7.4 matters, there’s a causal path from 5.2 to 7.4 to the output.
16.5.3 Building Circuit Diagrams
Path patching constructs the circuit graph:
- Use regular patching to identify important components (nodes)
- Use path patching to identify important connections (edges)
- The result is a circuit diagram showing how information flows
The IOI circuit (Chapter 8) was discovered using exactly this methodology: - Regular patching identified the 26 important heads - Path patching revealed how they connect (e.g., S-Inhibition heads suppress Name Mover heads) - The result was a complete circuit diagram with labeled components and connections
16.5.4 Computational Cost
Path patching is expensive: for \(n\) components, there are \(O(n^2)\) possible paths to test. Researchers use heuristics: - Only test paths between components with high individual patching effects - Use gradient-based approximations (attribution patching) to prioritize - Apply ACDC-style automated discovery
16.6 The Connection to Causal Inference
Activation patching is an application of causal intervention from statistics.
16.6.1 The Do-Operator
In causal inference notation, patching corresponds to the \(\text{do}()\) operator:
\[P(Y \mid \text{do}(X = x))\]
This asks: “What is the distribution of \(Y\) if we set \(X\) to value \(x\), rather than merely observing that \(X = x\)?”
Regular attribution computes \(P(Y \mid X = x)\)—the correlation between component values and outputs.
Patching computes \(P(Y \mid \text{do}(X = x))\)—the causal effect of setting a component to a specific value.
16.6.2 Why Intervention Differs from Observation
Observational data conflates multiple causal mechanisms: - \(X\) causes \(Y\) (direct effect) - \(Y\) causes \(X\) (reverse causation) - \(Z\) causes both \(X\) and \(Y\) (confounding)
Intervention breaks these dependencies by setting \(X\), eliminating reverse causation and confounding.
In neural networks: - Attribution observes: “This component’s value correlates with the output” - Patching intervenes: “Setting this component to value \(x\) causes output \(y\)”
Pearl’s “ladder of causation” distinguishes three levels: 1. Association: Seeing/observing (what attribution does) 2. Intervention: Doing/acting (what patching does) 3. Counterfactual: Imagining/reasoning (what would have happened)
Patching elevates interpretability from level 1 to level 2—from correlation to causation.
16.7 Practical Considerations
16.7.1 Constructing Good Corrupted Inputs
The quality of patching results depends on the corrupted input:
Minimal changes: The corrupted input should differ minimally from the clean input, changing only what’s necessary to flip the behavior.
Matched statistics: Token lengths, positions, and structure should be preserved. “When John and Mary went…” vs “When Mary and John went…” is better than “When John and Mary went…” vs “The cat sat on the mat.”
Multiple corruptions: Test with multiple corrupted inputs to ensure results generalize. A single bad corruption choice could mislead.
16.7.2 Choosing What to Patch
With hundreds of components, you can’t patch everything. Strategies:
Use attribution first: Patch components with high attribution scores. Attribution is cheap; use it to narrow the search.
Layer-by-layer scan: Patch the entire residual stream at each layer to find where critical information appears, then drill down.
Known hypotheses: If you hypothesize that a specific head matters, test it directly.
16.7.3 Interpreting Results
Beware of indirect effects: Patching a component might affect the output indirectly by changing what downstream components receive.
Beware of backup circuits: Low recovery from patching doesn’t prove unimportance—backup circuits might compensate. Noising tests help detect this.
Beware of distribution shift: Patched activations might be “out of distribution” for downstream components, causing unpredictable effects.
16.8 Patching Validates Attribution
Patching and attribution are complementary:
| Attribution | Patching |
|---|---|
| Shows what did contribute | Shows what must contribute |
| Cheap (one forward pass) | Expensive (many forward passes) |
| Correlational | Causal |
| Can mislead (redundancy) | Detects redundancy |
| Good for hypothesis generation | Good for hypothesis testing |
The workflow: 1. Run attribution to identify candidate components 2. Patch the top candidates to verify causal importance 3. Use path patching to trace connections between verified components 4. Build a circuit diagram from verified causal paths
Attribution is like profiler sampling—cheap, informative, but potentially misleading. Patching is like targeted benchmarking—you isolate a component and measure its actual impact. Good performance engineers use both: sampling to find candidates, benchmarking to verify.
16.9 Limitations
16.9.1 The Linearity Assumption
Patching assumes effects are roughly additive—that replacing one component’s activation has a predictable effect. But neural networks are nonlinear. The effect of patching might depend on the values of other components.
16.9.2 Computational Cost
Full patching analysis requires \(O(n)\) forward passes for \(n\) components, or \(O(n^2)\) for path patching. For large models with thousands of components, this becomes expensive.
Attribution patching (using gradients to approximate patching effects) reduces cost dramatically but trades accuracy for speed.
16.9.3 Clean/Corrupted Design
Results depend heavily on how you construct clean/corrupted pairs. Poor choices lead to: - Missing important components (if the corruption doesn’t exercise them) - Finding spurious importance (if the corruption changes irrelevant factors)
There’s no universal recipe for good corrupted inputs—it requires understanding the task.
16.9.4 Distribution Shift
Patching creates activations that the network never saw during training. A patched activation might be “out of distribution,” causing downstream components to behave unpredictably.
This is especially problematic for feature patching: forcing a feature to an unusual value might break assumptions the network relies on.
16.10 Polya’s Perspective: The Experimental Method
Patching embodies the scientific method: form hypotheses, then test them experimentally.
Attribution gives hypotheses: “Head 7.4 seems important for this behavior.”
Patching tests hypotheses: “If we intervene on head 7.4, does the behavior change?”
This is Polya’s heuristic of using all available data—not just observational data (what the network does), but experimental data (what happens when we change it).
“Use all the data.” Observation alone leaves ambiguity—many causal structures produce the same correlations. Intervention resolves ambiguity by actively manipulating the system. Patching gives us experimental data that observation cannot provide.
16.11 Looking Ahead
Patching tells us whether a component is necessary: “If we break this, does the behavior fail?”
But there’s a related question: “What happens if we completely remove this component?” This is ablation—the subject of the next chapter.
While patching swaps activations between inputs, ablation removes components entirely (setting them to zero, or to their mean value). Ablation reveals: - What the network can do without a component - Whether backup circuits compensate for removal - The “minimal sufficient circuit” for a behavior
Together, attribution, patching, and ablation form a complete toolkit for understanding neural network computations—observation, intervention, and removal.
16.12 Common Confusions
Not quite. High recovery from denoising patching means the component is sufficient to convey the information. But other components might also be sufficient (redundancy). Always combine with noising experiments to test necessity.
Low effect could mean: (1) the component genuinely doesn’t matter, (2) backup circuits compensate, or (3) your corrupted input was poorly chosen. If attribution says a component matters but patching says it doesn’t, investigate backup circuits or try different corruptions.
Patching creates out-of-distribution activations. If you patch a head output to an extreme value, downstream components may behave unpredictably. Results are most reliable when patched activations are realistic (come from a related input).
They answer different questions. Noising tests necessity (“is this component needed?”). Denoising tests sufficiency (“is this component enough?”). A component that fails noising but passes denoising is necessary; one that passes both is both necessary and sufficient.
16.13 Further Reading
Interpretability in the Wild: IOI — arXiv:2211.00593: The paper that established patching methodology by reverse-engineering the IOI circuit.
Causal Scrubbing — Redwood Research: A rigorous framework for validating causal hypotheses about circuits.
Attribution Patching — Neel Nanda: Using gradients to efficiently approximate patching effects.
Towards Automated Circuit Discovery (ACDC) — arXiv:2304.14997: Automated methods for finding minimal circuits using iterative patching.
Activation Patching in TransformerLens — GitHub: Implementation guide for running patching experiments.
Causal Inference in Statistics: A Primer — Pearl, Glymour, Jewell: The theoretical foundation for understanding patching as causal intervention.