24  Zoo of Known Circuits

A catalog of discovered mechanisms

A reference of documented circuits and mechanisms in language models, organized by function.


24.1 Attention Patterns

24.1.1 Induction Heads

Function: Pattern completion. Given [A][B]…[A], predict [B].

Location: Layers 1-2 in GPT-2 Small (two-layer circuit)

Components:

  • Previous token heads (L0): Copy info from position i to i+1
  • Induction heads (L1-2): Match current token, attend to position after previous occurrence

Verification: Ablating induction heads eliminates in-context learning

Reference: In-context Learning and Induction Heads


24.1.2 Previous Token Heads

Function: Attend primarily to the immediately preceding token

Location: Early layers (L0-1)

Signature: Attention pattern shows diagonal stripe (position n attends to n-1)

Use: Enables copying, serves as component in induction heads


24.1.3 Duplicate Token Heads

Function: Attend to previous occurrences of the current token

Location: Early-mid layers

Signature: High attention to positions with matching tokens


24.2 Language Structure

24.2.1 IOI (Indirect Object Identification)

Function: Complete “John gave Mary the ball. Mary gave [the ball to] ___” → “John”

Location: 26 attention heads across 7 functional groups

Components: 1. Duplicate Token Heads: Detect repeated names 2. S-Inhibition Heads: Suppress the subject 3. Name Mover Heads: Move the correct name to output 4. Backup Name Movers: Redundancy 5. Negative Name Movers: Fine-tuning

Size: ~2.8M parameters of 124M involved

Reference: Interpretability in the Wild: IOI


24.2.2 Greater-Than Circuit

Function: Compare numbers, e.g., “The war lasted from 1732 to 17___” → predict digits > 32

Location: MLPs in layers 9-11 (GPT-2 Small)

Mechanism: MLPs encode ordinal relationships between year tokens

Reference: Finding Arithmetic Circuits in Transformers


24.3 Knowledge Retrieval

24.3.1 Factual Recall (MLP-based)

Function: Complete factual statements like “The Eiffel Tower is in ___” → “Paris”

Location: Mid-layer MLPs (layers 7-9 in GPT-2 Small)

Mechanism: 1. Attention heads gather context about the subject 2. MLPs act as key-value stores (subject → attribute) 3. Late layers refine and commit to prediction

Evidence: Patching MLPs changes retrieved facts; ablating key MLPs disrupts recall

Reference: Locating and Editing Factual Associations (ROME paper)


24.3.2 Copy Suppression

Function: Suppress copying when it would be wrong

Location: Late attention layers

Mechanism: Negative attention to positions that would be incorrectly copied

Example: In “The cat sat. The dog ___“, suppress copying”sat” despite recency


24.4 Modular Arithmetic (Toy Models)

24.4.1 Grokking Circuits

Function: Compute modular addition (a + b mod p)

Location: One-layer transformer trained on modular arithmetic

Mechanism:

  • Fourier basis representation of inputs
  • Rotation in 2D embedding space
  • Discrete Fourier transform on residual stream

Significance: Shows neural networks can learn exact algorithms, not just heuristics

Reference: Progress Measures for Grokking via Mechanistic Interpretability


24.5 Safety-Relevant Features

24.5.2 Sycophancy Features

What: Features that activate when agreeing with user despite knowing better

Found in: Claude SAE analysis

Use: Detecting sycophantic drift during deployment


24.5.3 Refusal Features

What: Features associated with declining harmful requests

Found in: Multiple RLHF’d models

Mechanism: Safety training increases activation of these features for harmful prompts


24.6 Emerging Patterns

24.6.1 Multi-Step Reasoning Chains

Status: Partially understood

What we know:

  • Reasoning happens incrementally across layers
  • Information flows from question tokens through intermediate computations
  • Chain-of-thought helps by externalizing intermediate steps

What’s unclear:

  • Exact circuits for different reasoning types
  • How context length affects reasoning depth

24.6.2 Code Generation

Status: Poorly understood

Observations:

  • Different syntax structures activate different layer patterns
  • Function names and variable names processed differently
  • Indentation tracking involves specialized attention patterns

Challenge: Code behavior is diverse, making circuit isolation hard


24.7 How to Discover New Circuits

  1. Pick a narrow behavior: Choose something testable and specific
  2. Find contributing components: Use attribution to identify important heads/MLPs
  3. Verify causality: Use patching to confirm components are necessary
  4. Understand mechanism: Study attention patterns, MLP activations
  5. Test generalization: Check if circuit works on variations
  6. Document thoroughly: Share findings with clear methodology

24.8 Adding to This Zoo

If you discover a new circuit:

  1. Minimum requirements:
    • Clear description of the behavior
    • Components involved (heads, MLPs, layers)
    • Causal evidence (patching or ablation results)
    • Reproducible code
  2. Ideal additions:
    • Multiple verification methods
    • Transfer to other models
    • Connections to other known circuits

Open an issue or PR on the repository with your findings.


24.9 Further Reading