flowchart TB
subgraph ARC1["Arc I: Foundations"]
A1["Ch 1: Why Reverse Engineer?"]
A2["Ch 2: Transformers"]
A3["Ch 3: Residual Stream"]
A4["Ch 4: Geometry"]
end
subgraph ARC2["Arc II: Core Theory"]
A5["Ch 5: Features"]
A6["Ch 6: Superposition"]
A7["Ch 7: Toy Models"]
A8["Ch 8: Circuits"]
end
subgraph ARC3["Arc III: Techniques"]
A9["Ch 9: SAEs"]
A10["Ch 10: Attribution"]
A11["Ch 11: Patching"]
A12["Ch 12: Ablation"]
end
subgraph ARC4["Arc IV: Synthesis"]
A13["Ch 13: Induction Heads"]
A14["Ch 14: Open Problems"]
A15["Ch 15: Practice Regime"]
end
%% Arc I flow
A1 --> A2
A2 --> A3
A3 --> A4
%% Arc I to Arc II
A4 --> A5
A3 --> A5
%% Arc II flow
A5 --> A6
A6 --> A7
A5 --> A8
A7 --> A8
%% Arc II to Arc III
A6 --> A9
A3 --> A10
A8 --> A11
A8 --> A12
%% Arc III flow
A9 --> A10
A10 --> A11
A11 --> A12
%% All techniques feed into synthesis
A12 --> A13
A9 --> A13
A8 --> A13
%% Synthesis flow
A13 --> A14
A14 --> A15
%% Styling
style ARC1 fill:#e3f2fd,stroke:#1976d2
style ARC2 fill:#f3e5f5,stroke:#7b1fa2
style ARC3 fill:#e8f5e9,stroke:#388e3c
style ARC4 fill:#fff3e0,stroke:#f57c00
First Principles of Mechanistic Interpretability
Welcome
This is a first-principles exploration of mechanistic interpretability — the project of reverse engineering neural networks to understand how they work, not just that they work.
0.1 What This Book Is
A 15-chapter journey from foundations to practice:
- Arc I: Foundations (Chapters 1-4) — What are we trying to understand? The transformer architecture, residual stream, and geometric structure of representations.
- Arc II: Core Theory (Chapters 5-8) — Features, superposition, toy models, and circuits. The conceptual framework for interpretation.
- Arc III: Techniques (Chapters 9-12) — Sparse autoencoders, attribution, activation patching, and ablation. The tools of the trade.
- Arc IV: Synthesis (Chapters 13-15) — Induction heads as a complete case study, open problems in the field, and a practical guide to doing research.
0.2 How the Chapters Connect
The diagram below shows how concepts build on each other across the four arcs:
Choose based on your background and goals:
- Complete Learning Journey (5 hours)
- Chapters 1→15 in order. Best for deep understanding.
- ML Practitioner Fast Track (3 hours)
- Skip to Ch 5→8 (theory), then Ch 9→12 (techniques), then Ch 13 (case study).
- Assumes: You know transformers, attention, MLPs.
- Safety-Focused Path (2.5 hours)
- Ch 1 (motivation) → Ch 5-6 (features, superposition) → Ch 9 (SAEs) → Ch 14 (open problems)
- Goal: Understand what interpretability can and can’t do for AI safety.
- Hands-On Researcher (4 hours)
- Setup → First Analysis → Ch 9-12 (techniques) → Ch 15 (practice) → Exercises
- Goal: Start doing interpretability research quickly.
- Reference User
- Use Quick Reference as a cheat sheet, Zoo of Circuits for known mechanisms, Running Example to see all techniques applied to one behavior.
If you prefer learning by doing:
- Environment Setup (5 min) — Get TransformerLens running in Colab
- Your First Analysis (25 min) — Complete walkthrough analyzing a real behavior
You’ll have hands-on experience with attribution and patching before reading any theory.
Want to see what interpretability reveals before diving in?
Explore real features (5 min): Visit Neuronpedia and click “Random Feature.” Look at the max-activating examples. Can you guess what concept the feature represents? Try 3-4 features.
See attention patterns (5 min): Visit the Transformer Explainer to visualize how attention moves information between tokens.
You’ve now seen the two core phenomena this book explains: features (what networks represent) and attention (how they move information).
| Arc | Chapters | Total Reading Time |
|---|---|---|
| Arc I: Foundations | 1-4 + Summary | ~1 hour |
| Arc II: Core Theory | 5-8 + Summary | ~1.5 hours |
| Arc III: Techniques | 9-12 + Summary | ~1.5 hours |
| Arc IV: Synthesis | 13-15 | ~1 hour |
| Total | 15 chapters + 3 summaries | ~5 hours |
Each chapter is 12-22 minutes. Take breaks between arcs—the summaries are designed as natural stopping points.
0.3 Who This Is For
- Software engineers curious about ML internals
- ML practitioners who want deeper understanding
- Performance engineers interested in AI
- Anyone who values understanding why over just how
0.4 Background Assumed
This book is designed to be accessible, but some background helps:
- Linear algebra basics: Vectors, matrices, matrix multiplication, dot products
- High school math: Exponentials, logarithms, basic trigonometry
- No calculus required: Gradient descent is explained conceptually
If you can multiply a matrix by a vector and know that cos(90°) = 0, you have enough math.
- Python familiarity: Code examples use Python and PyTorch
- No ML experience required: We explain transformers from scratch
- Helpful but optional: Prior exposure to neural networks accelerates Arc I
0.5 Key Notation
Quick reference for notation used throughout the series:
| Symbol | Meaning | Example |
|---|---|---|
| \(x\) | Activation vector (residual stream state) | \(x \in \mathbb{R}^{768}\) |
| \(W\) | Weight matrix | \(W_Q\), \(W_K\), \(W_V\) for attention |
| \(W_E\) | Embedding matrix (tokens → vectors) | Maps “Paris” to a 768-dim vector |
| \(W_U\) | Unembedding matrix (vectors → logits) | Projects residual stream to vocabulary |
| \(d\) | Model dimension (hidden size) | GPT-2 Small: \(d = 768\) |
| \(n\) | Number of features | Often \(n >> d\) due to superposition |
| \(L\) | Layer index | \(x^{(L)}\) = residual stream at layer \(L\) |
| \(h\) | Attention head index | Head \(h\) in layer \(L\) |
| \(\text{softmax}\) | Converts scores to probabilities | \(\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\) |
Common terms explained:
- Logits: Raw (unnormalized) scores before softmax. Higher logit = model thinks token is more likely.
- Embedding: Converting discrete tokens into continuous vectors the model can process.
- Unembedding: The reverse—projecting internal vectors back to vocabulary-sized predictions.
- Hook: (TransformerLens) A callback that lets you read or modify activations during a forward pass.
0.6 The Approach
We use Polya’s problem-solving framework throughout: understand the problem before devising solutions, verify your understanding through intervention, and always ask “what would make this explanation wrong?”
We also bring a performance engineering mindset: you can’t optimize what you don’t understand, measure before you interpret, and never trust unvalidated claims.
0.7 Getting Started
Option 1: Start with theory — Begin with Chapter 1: Why Reverse Engineer Neural Networks? to understand the motivation and scope of the project.
Option 2: Start with code — Jump to Environment Setup and Your First Analysis to get hands-on experience immediately.
0.8 Also Published On
This series is also available on Software Bits on Substack.