timeline
title Evolution of Mechanistic Interpretability
section Foundations (2017-2019)
2017 : Attention Is All You Need
: Feature Visualization (Olah et al.)
2018 : BERT released
: GPT-1 released
2019 : GPT-2 released
: Building Blocks (Olah)
section Early Circuits (2020)
2020 : Zoom In (Circuits paper)
: Curve Detectors
: High-Low Frequency Detectors
section The Framework (2021)
2021 : Mathematical Framework
: Residual stream perspective
: TransformerLens created
section Core Discoveries (2022)
2022 : Induction Heads paper
: Toy Models of Superposition
: Grokking mechanistic analysis
section Scaling Up (2023)
2023 : Towards Monosemanticity
: Dictionary learning at scale
: Neuronpedia launched
section Current Era (2024-2025)
2024 : Scaling Monosemanticity
: Anthropic Golden Gate Claude
: Automated circuit discovery
2025 : SAEBench evaluation
: Multi-phase emergence
: Feature absorption discovered
29 Historical Timeline
The evolution of mechanistic interpretability (2017-2025)
Understanding where the field came from helps you understand where it’s going. This timeline traces the key discoveries, papers, and paradigm shifts that shaped modern mechanistic interpretability.
29.1 The Visual Timeline
29.2 Detailed Timeline
29.2.1 2017: The Transformer and Feature Visualization
June 2017 — Attention Is All You Need Vaswani et al. introduce the Transformer architecture. At the time, nobody imagined we’d be reverse-engineering these systems. The paper focuses on translation performance, not interpretability.
November 2017 — Feature Visualization Chris Olah, Alexander Mordvintsev, and Ludwig Schubert publish Feature Visualization on Distill. This establishes that neural network features can be visualized and (sometimes) interpreted. The focus is on vision models, but the conceptual framework—features as meaningful units—will prove foundational.
29.2.2 2018-2019: The Language Model Era Begins
2018 — BERT and GPT-1 Large pretrained language models arrive. BERT shows that pretraining works; GPT-1 shows that autoregressive generation works. Interpretability research is sparse—most work focuses on making models bigger and better.
2019 — GPT-2 and Building Blocks OpenAI releases GPT-2, demonstrating that scaling language models leads to qualitative capability jumps.
Chris Olah publishes Building Blocks of Interpretability, introducing key ideas about how to visualize and understand neural network internals. Still focused on vision, but the groundwork for language model interpretability is being laid.
29.2.3 2020: The Circuits Era Begins
March 2020 — Zoom In: An Introduction to Circuits The Circuits thread launches on Distill. This is the conceptual birth of mechanistic interpretability as we know it.
Key ideas introduced:
- Circuits: Subnetworks implementing identifiable algorithms
- Features: The fundamental units of representation
- Universality: Similar circuits across different networks
The focus is still on vision models (InceptionV1), but the framework is general.
2020 — Curve Detectors, High-Low Frequency A series of papers document specific circuits in vision models:
- Curve detector circuits
- High-low frequency detector circuits
- Branch specialization
These establish the methodology: identify a behavior, isolate the responsible components, understand the algorithm.
29.2.4 2021: The Mathematical Framework
December 2021 — A Mathematical Framework for Transformer Circuits This paper from Anthropic changes everything. It provides:
- The residual stream perspective: Viewing the transformer as components reading from and writing to a shared workspace
- Virtual attention heads: Understanding how heads compose across layers
- Precise mathematical language for describing transformer computations
This paper establishes the conceptual vocabulary still used today: residual stream, QK circuit, OV circuit, composition.
2021 — TransformerLens Created Neel Nanda creates TransformerLens, the standard library for mechanistic interpretability research. It provides:
- Easy access to all intermediate activations
- Hooks for interventions
- Support for many model architectures
This dramatically lowers the barrier to entry for interpretability research.
29.2.5 2022: The Big Discoveries
March 2022 — In-Context Learning and Induction Heads This paper documents the first complete, validated circuit in a language model.
Key findings:
- Induction heads implement in-context learning
- They emerge via a phase transition during training
- The circuit requires two layers (previous token head → induction head)
This becomes the canonical example of mechanistic interpretability done well.
September 2022 — Toy Models of Superposition This paper explains why neurons are polysemantic:
- Networks pack more features than dimensions (superposition)
- Sparsity enables superposition
- Phase transitions govern when superposition occurs
This paper explains why interpretability is hard and points toward solutions (sparse autoencoders).
2022 — Grokking Analysis Neel Nanda and colleagues analyze grokking—sudden generalization after prolonged memorization. They find:
- Grokking corresponds to learning efficient algorithms
- The model transitions from memorization circuits to generalization circuits
- Mechanistic analysis can predict when grokking occurs
29.2.6 2023: Scaling Dictionary Learning
October 2023 — Towards Monosemanticity This paper from Anthropic demonstrates that sparse autoencoders work at scale:
- Training SAEs on a 1-layer transformer
- Finding interpretable, monosemantic features
- Demonstrating that features can be used for steering
This proves the concept and sets the stage for larger-scale work.
2023 — Neuronpedia Launched Neuronpedia launches, providing:
- Interactive exploration of SAE features
- Crowdsourced feature interpretations
- A shared vocabulary for discussing features
This transforms SAE research from individual exploration to community science.
29.2.7 2024: The Scaling Era
May 2024 — Scaling Monosemanticity This paper scales SAEs to Claude 3 Sonnet:
- 34 million features extracted
- Abstract features discovered (e.g., “deception,” “sycophancy”)
- Features can be used to steer model behavior
The “Golden Gate Claude” demonstration shows that adding the “Golden Gate Bridge” feature makes Claude obsess about the bridge—dramatic proof that features are real.
2024 — Automated Circuit Discovery Multiple papers develop automated methods for finding circuits:
- ACDC (Automated Circuit DisCovery)
- Subnetwork Probing
- Attribution Patching at scale
The goal: move from artisanal circuit analysis to systematic discovery.
2024 — Attention Pattern Taxonomy Researchers develop taxonomies of attention head behaviors:
- Previous token heads
- Induction heads
- Duplicate token heads
- Positional heads
This moves toward a “parts list” for transformer internals.
29.2.8 2025: Current Frontiers
2025 — SAEBench SAEBench provides standardized evaluation for SAEs:
- Benchmark tasks for assessing feature quality
- Discovery that proxy metrics (L0, reconstruction) don’t predict downstream utility
- Push for task-specific evaluation
2025 — Multi-Phase Emergence Research reveals that capability emergence is more complex than single phase transitions:
- Multiple distinct phases during training
- Different capabilities emerge at different times
- Implications for understanding training dynamics
2025 — Feature Absorption Researchers discover that scaling SAEs has limits:
- Common features “absorb” related features
- Hierarchical concepts don’t decompose cleanly
- New architectures (matryoshka SAEs) proposed as solutions
29.3 Key Paradigm Shifts
29.3.1 Shift 1: Neurons → Features (2017-2022)
Old view: Neurons are the unit of analysis New view: Features (directions in activation space) are the unit of analysis
29.3.2 Shift 2: Behavior → Mechanism (2020-2022)
Old view: Understand what the model does New view: Understand how the model does it
29.3.3 Shift 3: Single Component → Circuit (2021-2022)
Old view: Find the component responsible for a behavior New view: Find the circuit (connected components) that implements the behavior
29.3.4 Shift 4: Artisanal → Automated (2023-2025)
Old view: Manually analyze individual circuits New view: Develop automated tools for systematic analysis
29.3.5 Shift 5: Proof of Concept → Rigorous Evaluation (2024-2025)
Old view: Demonstrate that interpretability is possible New view: Develop metrics and benchmarks for interpretability quality
29.4 The People
Major contributors to the field:
Chris Olah — Pioneered feature visualization, the circuits framework, and much of the conceptual vocabulary. Co-founded Anthropic’s interpretability team.
Neel Nanda — Created TransformerLens, wrote foundational tutorials, discovered grokking mechanisms. Made interpretability accessible.
Catherine Olsson — Co-authored the Mathematical Framework and Induction Heads papers. Foundational technical contributions.
Nelson Elhage — Lead author on Toy Models of Superposition. Technical lead on much of Anthropic’s interpretability work.
Tom Brown, Sam McCandlish — Anthropic researchers contributing to scaling SAEs and other foundational work.
Arthur Conmy — Developed ACDC for automated circuit discovery. Key contributor to tooling.
29.5 Reading the History
If you’re new to the field, read these papers in order:
- Zoom In (2020) — The vision and conceptual framework
- Mathematical Framework (2021) — The technical vocabulary for transformers
- Induction Heads (2022) — The first complete circuit
- Toy Models of Superposition (2022) — Why interpretability is hard
- Towards Monosemanticity (2023) — The SAE solution
- Scaling Monosemanticity (2024) — SAEs at scale
These six papers contain 90% of what you need to understand modern mechanistic interpretability.
29.6 What’s Next?
Open questions the field is working on:
- Scaling interpretability — Can we analyze GPT-4-scale models?
- Automated discovery — Can we find circuits without manual analysis?
- Causal interventions — Can we reliably steer model behavior?
- Training dynamics — Can we understand how capabilities emerge?
- Universality — Do the same circuits appear across models?
The field is young. The major discoveries are likely still ahead of us.