First Principles of Mechanistic Interpretability

Author

Taras Tsugrii

Published

January 14, 2026

Welcome

This is a first-principles exploration of mechanistic interpretability — the project of reverse engineering neural networks to understand how they work, not just that they work.

0.1 What This Book Is

A 15-chapter journey from foundations to practice:

Arc I: Foundations (Chapters 1-4) — What are we trying to understand? The transformer architecture, residual stream, and geometric structure of representations.
Arc II: Core Theory (Chapters 5-8) — Features, superposition, toy models, and circuits. The conceptual framework for interpretation.
Arc III: Techniques (Chapters 9-12) — Sparse autoencoders, attribution, activation patching, and ablation. The tools of the trade.
Arc IV: Synthesis (Chapters 13-15) — Induction heads as a complete case study, open problems in the field, and a practical guide to doing research.

0.2 How the Chapters Connect

The diagram below shows how concepts build on each other across the four arcs:

flowchart TB
    subgraph ARC1["Arc I: Foundations"]
        A1["Ch 1: Why Reverse Engineer?"]
        A2["Ch 2: Transformers"]
        A3["Ch 3: Residual Stream"]
        A4["Ch 4: Geometry"]
    end

    subgraph ARC2["Arc II: Core Theory"]
        A5["Ch 5: Features"]
        A6["Ch 6: Superposition"]
        A7["Ch 7: Toy Models"]
        A8["Ch 8: Circuits"]
    end

    subgraph ARC3["Arc III: Techniques"]
        A9["Ch 9: SAEs"]
        A10["Ch 10: Attribution"]
        A11["Ch 11: Patching"]
        A12["Ch 12: Ablation"]
    end

    subgraph ARC4["Arc IV: Synthesis"]
        A13["Ch 13: Induction Heads"]
        A14["Ch 14: Open Problems"]
        A15["Ch 15: Practice Regime"]
    end

    %% Arc I flow
    A1 --> A2
    A2 --> A3
    A3 --> A4

    %% Arc I to Arc II
    A4 --> A5
    A3 --> A5

    %% Arc II flow
    A5 --> A6
    A6 --> A7
    A5 --> A8
    A7 --> A8

    %% Arc II to Arc III
    A6 --> A9
    A3 --> A10
    A8 --> A11
    A8 --> A12

    %% Arc III flow
    A9 --> A10
    A10 --> A11
    A11 --> A12

    %% All techniques feed into synthesis
    A12 --> A13
    A9 --> A13
    A8 --> A13

    %% Synthesis flow
    A13 --> A14
    A14 --> A15

    %% Styling
    style ARC1 fill:#e3f2fd,stroke:#1976d2
    style ARC2 fill:#f3e5f5,stroke:#7b1fa2
    style ARC3 fill:#e8f5e9,stroke:#388e3c
    style ARC4 fill:#fff3e0,stroke:#f57c00

Concept map showing how chapters build on each other. Arrows indicate conceptual dependencies.

Reading Paths

Choose based on your background and goals:

Complete Learning Journey (5 hours): Chapters 1→15 in order. Best for deep understanding.
ML Practitioner Fast Track (3 hours): Skip to Ch 5→8 (theory), then Ch 9→12 (techniques), then Ch 13 (case study).; Assumes: You know transformers, attention, MLPs.
Safety-Focused Path (2.5 hours): Ch 1 (motivation) → Ch 5-6 (features, superposition) → Ch 9 (SAEs) → Ch 14 (open problems); Goal: Understand what interpretability can and can’t do for AI safety.
Hands-On Researcher (4 hours): Setup → First Analysis → Ch 9-12 (techniques) → Ch 15 (practice) → Exercises; Goal: Start doing interpretability research quickly.
Reference User: Use Quick Reference as a cheat sheet, Zoo of Circuits for known mechanisms, Running Example to see all techniques applied to one behavior.

Want to Start Coding? (30 minutes)

If you prefer learning by doing:

Environment Setup (5 min) — Get TransformerLens running in Colab
Your First Analysis (25 min) — Complete walkthrough analyzing a real behavior

You’ll have hands-on experience with attribution and patching before reading any theory.

Want a Quick Win? (10 minutes)

Want to see what interpretability reveals before diving in?

Explore real features (5 min): Visit Neuronpedia and click “Random Feature.” Look at the max-activating examples. Can you guess what concept the feature represents? Try 3-4 features.
See attention patterns (5 min): Visit the Transformer Explainer to visualize how attention moves information between tokens.

You’ve now seen the two core phenomena this book explains: features (what networks represent) and attention (how they move information).

Estimated Reading Times

Arc	Chapters	Total Reading Time
Arc I: Foundations	1-4 + Summary	~1 hour
Arc II: Core Theory	5-8 + Summary	~1.5 hours
Arc III: Techniques	9-12 + Summary	~1.5 hours
Arc IV: Synthesis	13-15	~1 hour
Total	15 chapters + 3 summaries	~5 hours

Each chapter is 12-22 minutes. Take breaks between arcs—the summaries are designed as natural stopping points.

0.3 Who This Is For

Software engineers curious about ML internals
ML practitioners who want deeper understanding
Performance engineers interested in AI
Anyone who values understanding why over just how

0.4 Background Assumed

This book is designed to be accessible, but some background helps:

Mathematical Background

Linear algebra basics: Vectors, matrices, matrix multiplication, dot products
High school math: Exponentials, logarithms, basic trigonometry
No calculus required: Gradient descent is explained conceptually

If you can multiply a matrix by a vector and know that cos(90°) = 0, you have enough math.

Programming Background

Python familiarity: Code examples use Python and PyTorch
No ML experience required: We explain transformers from scratch
Helpful but optional: Prior exposure to neural networks accelerates Arc I

0.5 Key Notation

Quick reference for notation used throughout the series:

Symbol	Meaning	Example
$x$	Activation vector (residual stream state)	$x \in \mathbb{R}^{768}$
$W$	Weight matrix	$W_Q$, $W_K$, $W_V$ for attention
$W_E$	Embedding matrix (tokens → vectors)	Maps “Paris” to a 768-dim vector
$W_U$	Unembedding matrix (vectors → logits)	Projects residual stream to vocabulary
$d$	Model dimension (hidden size)	GPT-2 Small: $d = 768$
$n$	Number of features	Often $n >> d$ due to superposition
$L$	Layer index	$x^{(L)}$ = residual stream at layer $L$
$h$	Attention head index	Head $h$ in layer $L$
$\text{softmax}$	Converts scores to probabilities	$\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$

Common terms explained:

Logits: Raw (unnormalized) scores before softmax. Higher logit = model thinks token is more likely.
Embedding: Converting discrete tokens into continuous vectors the model can process.
Unembedding: The reverse—projecting internal vectors back to vocabulary-sized predictions.
Hook: (TransformerLens) A callback that lets you read or modify activations during a forward pass.

0.6 The Approach

We use Polya’s problem-solving framework throughout: understand the problem before devising solutions, verify your understanding through intervention, and always ask “what would make this explanation wrong?”

We also bring a performance engineering mindset: you can’t optimize what you don’t understand, measure before you interpret, and never trust unvalidated claims.

0.7 Getting Started

Option 1: Start with theory — Begin with Chapter 1: Why Reverse Engineer Neural Networks? to understand the motivation and scope of the project.

Option 2: Start with code — Jump to Environment Setup and Your First Analysis to get hands-on experience immediately.

0.8 Also Published On

This series is also available on Software Bits on Substack.

0.9 More from the Author

0.9.1 The First Principles Trilogy

This book is part of a series teaching ML fundamentals from first principles:

📘 Building LLMs from First Principles Learn how transformers work by building them from scratch—full math derivations, working code, and comprehensive test suites. From Markov chains to GPT.

🔬 Mechanistic Interpretability from First Principles (You are here) Reverse-engineer neural networks to understand their internal algorithms. Features, superposition, circuits, and sparse autoencoders explained from the ground up.

⚡ The Algebra of Speed Mathematical foundations of computational performance. Why FlashAttention, LoRA, and quantization work—and how to recognize when similar optimizations apply to your problems.

0.9.2 Blog

✍️ Software Bits — Short, focused essays on performance, ML, and computer science fundamentals. Subscribe for updates.

💻 GitHub: perf-bits — Blog posts with full code and interactive demos.

Last updated: January 2025

--- title: "First Principles of Mechanistic Interpretability" --- # Welcome {.unnumbered} This is a first-principles exploration of **mechanistic interpretability** — the project of reverse engineering neural networks to understand *how* they work, not just *that* they work. ## What This Book Is A 15-chapter journey from foundations to practice: - **Arc I: Foundations** (Chapters 1-4) — What are we trying to understand? The transformer architecture, residual stream, and geometric structure of representations. - **Arc II: Core Theory** (Chapters 5-8) — Features, superposition, toy models, and circuits. The conceptual framework for interpretation. - **Arc III: Techniques** (Chapters 9-12) — Sparse autoencoders, attribution, activation patching, and ablation. The tools of the trade. - **Arc IV: Synthesis** (Chapters 13-15) — Induction heads as a complete case study, open problems in the field, and a practical guide to doing research. ## How the Chapters Connect The diagram below shows how concepts build on each other across the four arcs: ```{mermaid} %%| fig-cap: "Concept map showing how chapters build on each other. Arrows indicate conceptual dependencies." %%| fig-width: 10 flowchart TB subgraph ARC1["Arc I: Foundations"] A1["Ch 1: Why Reverse Engineer?"] A2["Ch 2: Transformers"] A3["Ch 3: Residual Stream"] A4["Ch 4: Geometry"] end subgraph ARC2["Arc II: Core Theory"] A5["Ch 5: Features"] A6["Ch 6: Superposition"] A7["Ch 7: Toy Models"] A8["Ch 8: Circuits"] end subgraph ARC3["Arc III: Techniques"] A9["Ch 9: SAEs"] A10["Ch 10: Attribution"] A11["Ch 11: Patching"] A12["Ch 12: Ablation"] end subgraph ARC4["Arc IV: Synthesis"] A13["Ch 13: Induction Heads"] A14["Ch 14: Open Problems"] A15["Ch 15: Practice Regime"] end %% Arc I flow A1 --> A2 A2 --> A3 A3 --> A4 %% Arc I to Arc II A4 --> A5 A3 --> A5 %% Arc II flow A5 --> A6 A6 --> A7 A5 --> A8 A7 --> A8 %% Arc II to Arc III A6 --> A9 A3 --> A10 A8 --> A11 A8 --> A12 %% Arc III flow A9 --> A10 A10 --> A11 A11 --> A12 %% All techniques feed into synthesis A12 --> A13 A9 --> A13 A8 --> A13 %% Synthesis flow A13 --> A14 A14 --> A15 %% Styling style ARC1 fill:#e3f2fd,stroke:#1976d2 style ARC2 fill:#f3e5f5,stroke:#7b1fa2 style ARC3 fill:#e8f5e9,stroke:#388e3c style ARC4 fill:#fff3e0,stroke:#f57c00 ``` ::: {.callout-tip} ## Reading Paths Choose based on your background and goals: **Complete Learning Journey** (5 hours) : Chapters 1→15 in order. Best for deep understanding. **ML Practitioner Fast Track** (3 hours) : Skip to Ch 5→8 (theory), then Ch 9→12 (techniques), then Ch 13 (case study). : *Assumes*: You know transformers, attention, MLPs. **Safety-Focused Path** (2.5 hours) : Ch 1 (motivation) → Ch 5-6 (features, superposition) → Ch 9 (SAEs) → Ch 14 (open problems) : *Goal*: Understand what interpretability can and can't do for AI safety. **Hands-On Researcher** (4 hours) : [Setup](setup.qmd) → [First Analysis](first-analysis.qmd) → Ch 9-12 (techniques) → Ch 15 (practice) → [Exercises](exercises.qmd) : *Goal*: Start doing interpretability research quickly. **Reference User** : Use [Quick Reference](quick-reference.qmd) as a cheat sheet, [Zoo of Circuits](zoo-of-circuits.qmd) for known mechanisms, [Running Example](running-example.qmd) to see all techniques applied to one behavior. ::: ::: {.callout-important} ## Want to Start Coding? (30 minutes) If you prefer learning by doing: 1. **[Environment Setup](setup.qmd)** (5 min) — Get TransformerLens running in Colab 2. **[Your First Analysis](first-analysis.qmd)** (25 min) — Complete walkthrough analyzing a real behavior You'll have hands-on experience with attribution and patching before reading any theory. ::: ::: {.callout-tip} ## Want a Quick Win? (10 minutes) Want to *see* what interpretability reveals before diving in? 1. **Explore real features** (5 min): Visit [Neuronpedia](https://www.neuronpedia.org/gpt2-small) and click "Random Feature." Look at the max-activating examples. Can you guess what concept the feature represents? Try 3-4 features. 2. **See attention patterns** (5 min): Visit the [Transformer Explainer](https://poloclub.github.io/transformer-explainer/) to visualize how attention moves information between tokens. You've now seen the two core phenomena this book explains: **features** (what networks represent) and **attention** (how they move information). ::: ::: {.callout-note} ## Estimated Reading Times | Arc | Chapters | Total Reading Time | |-----|----------|-------------------| | **Arc I: Foundations** | 1-4 + Summary | ~1 hour | | **Arc II: Core Theory** | 5-8 + Summary | ~1.5 hours | | **Arc III: Techniques** | 9-12 + Summary | ~1.5 hours | | **Arc IV: Synthesis** | 13-15 | ~1 hour | | **Total** | 15 chapters + 3 summaries | **~5 hours** | Each chapter is 12-22 minutes. Take breaks between arcs—the summaries are designed as natural stopping points. ::: ## Who This Is For - Software engineers curious about ML internals - ML practitioners who want deeper understanding - Performance engineers interested in AI - Anyone who values understanding *why* over just *how* ## Background Assumed This book is designed to be accessible, but some background helps: ::: {.callout-note} ## Mathematical Background - **Linear algebra basics**: Vectors, matrices, matrix multiplication, dot products - **High school math**: Exponentials, logarithms, basic trigonometry - **No calculus required**: Gradient descent is explained conceptually If you can multiply a matrix by a vector and know that cos(90°) = 0, you have enough math. ::: ::: {.callout-note} ## Programming Background - **Python familiarity**: Code examples use Python and PyTorch - **No ML experience required**: We explain transformers from scratch - **Helpful but optional**: Prior exposure to neural networks accelerates Arc I ::: ## Key Notation Quick reference for notation used throughout the series: | Symbol | Meaning | Example | |--------|---------|---------| | $x$ | Activation vector (residual stream state) | $x \in \mathbb{R}^{768}$ | | $W$ | Weight matrix | $W_Q$, $W_K$, $W_V$ for attention | | $W_E$ | Embedding matrix (tokens → vectors) | Maps "Paris" to a 768-dim vector | | $W_U$ | Unembedding matrix (vectors → logits) | Projects residual stream to vocabulary | | $d$ | Model dimension (hidden size) | GPT-2 Small: $d = 768$ | | $n$ | Number of features | Often $n >> d$ due to superposition | | $L$ | Layer index | $x^{(L)}$ = residual stream at layer $L$ | | $h$ | Attention head index | Head $h$ in layer $L$ | | $\text{softmax}$ | Converts scores to probabilities | $\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$ | **Common terms explained**: - **Logits**: Raw (unnormalized) scores before softmax. Higher logit = model thinks token is more likely. - **Embedding**: Converting discrete tokens into continuous vectors the model can process. - **Unembedding**: The reverse—projecting internal vectors back to vocabulary-sized predictions. - **Hook**: (TransformerLens) A callback that lets you read or modify activations during a forward pass. ## The Approach We use **Polya's problem-solving framework** throughout: understand the problem before devising solutions, verify your understanding through intervention, and always ask "what would make this explanation wrong?" We also bring a **performance engineering mindset**: you can't optimize what you don't understand, measure before you interpret, and never trust unvalidated claims. ## Getting Started **Option 1: Start with theory** — Begin with [Chapter 1: Why Reverse Engineer Neural Networks?](chapters/01-why-reverse-engineer.qmd) to understand the motivation and scope of the project. **Option 2: Start with code** — Jump to [Environment Setup](setup.qmd) and [Your First Analysis](first-analysis.qmd) to get hands-on experience immediately. ## Also Published On This series is also available on [Software Bits](https://softwarebits.substack.com/) on Substack. --- ## More from the Author ### The First Principles Trilogy This book is part of a series teaching ML fundamentals from first principles: 📘 **[Building LLMs from First Principles](https://ttsugriy.github.io/llm-first-principles/)** Learn how transformers work by building them from scratch—full math derivations, working code, and comprehensive test suites. From Markov chains to GPT. 🔬 **Mechanistic Interpretability from First Principles** *(You are here)* Reverse-engineer neural networks to understand their internal algorithms. Features, superposition, circuits, and sparse autoencoders explained from the ground up. ⚡ **[The Algebra of Speed](https://ttsugriy.github.io/performance-book/)** Mathematical foundations of computational performance. Why FlashAttention, LoRA, and quantization work—and how to recognize when similar optimizations apply to your problems. ### Blog ✍️ **[Software Bits](https://softwarebits.substack.com/)** — Short, focused essays on performance, ML, and computer science fundamentals. Subscribe for updates. 💻 **[GitHub: perf-bits](https://github.com/ttsugriy/perf-bits)** — Blog posts with full code and interactive demos. --- *Last updated: January 2025*

Symbol	Meaning	Example
\(x\)	Activation vector (residual stream state)	\(x \in \mathbb{R}^{768}\)
\(W\)	Weight matrix	\(W_Q\), \(W_K\), \(W_V\) for attention
\(W_E\)	Embedding matrix (tokens → vectors)	Maps “Paris” to a 768-dim vector
\(W_U\)	Unembedding matrix (vectors → logits)	Projects residual stream to vocabulary
\(d\)	Model dimension (hidden size)	GPT-2 Small: \(d = 768\)
\(n\)	Number of features	Often \(n >> d\) due to superposition
\(L\)	Layer index	\(x^{(L)}\) = residual stream at layer \(L\)
\(h\)	Attention head index	Head \(h\) in layer \(L\)
\(\text{softmax}\)	Converts scores to probabilities	\(\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\)