24 Zoo of Known Circuits
A catalog of discovered mechanisms
A reference of documented circuits and mechanisms in language models, organized by function.
24.1 Attention Patterns
24.1.1 Induction Heads
Function: Pattern completion. Given [A][B]…[A], predict [B].
Location: Layers 1-2 in GPT-2 Small (two-layer circuit)
Components:
- Previous token heads (L0): Copy info from position i to i+1
- Induction heads (L1-2): Match current token, attend to position after previous occurrence
Verification: Ablating induction heads eliminates in-context learning
Reference: In-context Learning and Induction Heads
24.1.2 Previous Token Heads
Function: Attend primarily to the immediately preceding token
Location: Early layers (L0-1)
Signature: Attention pattern shows diagonal stripe (position n attends to n-1)
Use: Enables copying, serves as component in induction heads
24.1.3 Duplicate Token Heads
Function: Attend to previous occurrences of the current token
Location: Early-mid layers
Signature: High attention to positions with matching tokens
24.2 Language Structure
24.2.1 IOI (Indirect Object Identification)
Function: Complete “John gave Mary the ball. Mary gave [the ball to] ___” → “John”
Location: 26 attention heads across 7 functional groups
Components: 1. Duplicate Token Heads: Detect repeated names 2. S-Inhibition Heads: Suppress the subject 3. Name Mover Heads: Move the correct name to output 4. Backup Name Movers: Redundancy 5. Negative Name Movers: Fine-tuning
Size: ~2.8M parameters of 124M involved
Reference: Interpretability in the Wild: IOI
24.2.2 Greater-Than Circuit
Function: Compare numbers, e.g., “The war lasted from 1732 to 17___” → predict digits > 32
Location: MLPs in layers 9-11 (GPT-2 Small)
Mechanism: MLPs encode ordinal relationships between year tokens
Reference: Finding Arithmetic Circuits in Transformers
24.3 Knowledge Retrieval
24.3.1 Factual Recall (MLP-based)
Function: Complete factual statements like “The Eiffel Tower is in ___” → “Paris”
Location: Mid-layer MLPs (layers 7-9 in GPT-2 Small)
Mechanism: 1. Attention heads gather context about the subject 2. MLPs act as key-value stores (subject → attribute) 3. Late layers refine and commit to prediction
Evidence: Patching MLPs changes retrieved facts; ablating key MLPs disrupts recall
Reference: Locating and Editing Factual Associations (ROME paper)
24.3.2 Copy Suppression
Function: Suppress copying when it would be wrong
Location: Late attention layers
Mechanism: Negative attention to positions that would be incorrectly copied
Example: In “The cat sat. The dog ___“, suppress copying”sat” despite recency
24.4 Modular Arithmetic (Toy Models)
24.4.1 Grokking Circuits
Function: Compute modular addition (a + b mod p)
Location: One-layer transformer trained on modular arithmetic
Mechanism:
- Fourier basis representation of inputs
- Rotation in 2D embedding space
- Discrete Fourier transform on residual stream
Significance: Shows neural networks can learn exact algorithms, not just heuristics
Reference: Progress Measures for Grokking via Mechanistic Interpretability
24.5 Safety-Relevant Features
24.5.2 Sycophancy Features
What: Features that activate when agreeing with user despite knowing better
Found in: Claude SAE analysis
Use: Detecting sycophantic drift during deployment
24.5.3 Refusal Features
What: Features associated with declining harmful requests
Found in: Multiple RLHF’d models
Mechanism: Safety training increases activation of these features for harmful prompts
24.6 Emerging Patterns
24.6.1 Multi-Step Reasoning Chains
Status: Partially understood
What we know:
- Reasoning happens incrementally across layers
- Information flows from question tokens through intermediate computations
- Chain-of-thought helps by externalizing intermediate steps
What’s unclear:
- Exact circuits for different reasoning types
- How context length affects reasoning depth
24.6.2 Code Generation
Status: Poorly understood
Observations:
- Different syntax structures activate different layer patterns
- Function names and variable names processed differently
- Indentation tracking involves specialized attention patterns
Challenge: Code behavior is diverse, making circuit isolation hard
24.7 How to Discover New Circuits
- Pick a narrow behavior: Choose something testable and specific
- Find contributing components: Use attribution to identify important heads/MLPs
- Verify causality: Use patching to confirm components are necessary
- Understand mechanism: Study attention patterns, MLP activations
- Test generalization: Check if circuit works on variations
- Document thoroughly: Share findings with clear methodology
24.8 Adding to This Zoo
If you discover a new circuit:
- Minimum requirements:
- Clear description of the behavior
- Components involved (heads, MLPs, layers)
- Causal evidence (patching or ablation results)
- Reproducible code
- Ideal additions:
- Multiple verification methods
- Transfer to other models
- Connections to other known circuits
Open an issue or PR on the repository with your findings.
24.9 Further Reading
- Transformer Circuits Thread — Anthropic’s ongoing circuit discoveries
- 200 Concrete Problems — Research questions including many circuit-hunting problems
- ACDC: Automated Circuit Discovery — Tools for scaling circuit discovery