First Principles of Mechanistic Interpretability

Author

Taras Tsugrii

Published

January 14, 2026

Welcome

This is a first-principles exploration of mechanistic interpretability — the project of reverse engineering neural networks to understand how they work, not just that they work.

0.1 What This Book Is

A 15-chapter journey from foundations to practice:

  • Arc I: Foundations (Chapters 1-4) — What are we trying to understand? The transformer architecture, residual stream, and geometric structure of representations.
  • Arc II: Core Theory (Chapters 5-8) — Features, superposition, toy models, and circuits. The conceptual framework for interpretation.
  • Arc III: Techniques (Chapters 9-12) — Sparse autoencoders, attribution, activation patching, and ablation. The tools of the trade.
  • Arc IV: Synthesis (Chapters 13-15) — Induction heads as a complete case study, open problems in the field, and a practical guide to doing research.

0.2 How the Chapters Connect

The diagram below shows how concepts build on each other across the four arcs:

flowchart TB
    subgraph ARC1["Arc I: Foundations"]
        A1["Ch 1: Why Reverse Engineer?"]
        A2["Ch 2: Transformers"]
        A3["Ch 3: Residual Stream"]
        A4["Ch 4: Geometry"]
    end

    subgraph ARC2["Arc II: Core Theory"]
        A5["Ch 5: Features"]
        A6["Ch 6: Superposition"]
        A7["Ch 7: Toy Models"]
        A8["Ch 8: Circuits"]
    end

    subgraph ARC3["Arc III: Techniques"]
        A9["Ch 9: SAEs"]
        A10["Ch 10: Attribution"]
        A11["Ch 11: Patching"]
        A12["Ch 12: Ablation"]
    end

    subgraph ARC4["Arc IV: Synthesis"]
        A13["Ch 13: Induction Heads"]
        A14["Ch 14: Open Problems"]
        A15["Ch 15: Practice Regime"]
    end

    %% Arc I flow
    A1 --> A2
    A2 --> A3
    A3 --> A4

    %% Arc I to Arc II
    A4 --> A5
    A3 --> A5

    %% Arc II flow
    A5 --> A6
    A6 --> A7
    A5 --> A8
    A7 --> A8

    %% Arc II to Arc III
    A6 --> A9
    A3 --> A10
    A8 --> A11
    A8 --> A12

    %% Arc III flow
    A9 --> A10
    A10 --> A11
    A11 --> A12

    %% All techniques feed into synthesis
    A12 --> A13
    A9 --> A13
    A8 --> A13

    %% Synthesis flow
    A13 --> A14
    A14 --> A15

    %% Styling
    style ARC1 fill:#e3f2fd,stroke:#1976d2
    style ARC2 fill:#f3e5f5,stroke:#7b1fa2
    style ARC3 fill:#e8f5e9,stroke:#388e3c
    style ARC4 fill:#fff3e0,stroke:#f57c00

Concept map showing how chapters build on each other. Arrows indicate conceptual dependencies.

TipReading Paths

Choose based on your background and goals:

Complete Learning Journey (5 hours)
Chapters 1→15 in order. Best for deep understanding.
ML Practitioner Fast Track (3 hours)
Skip to Ch 5→8 (theory), then Ch 9→12 (techniques), then Ch 13 (case study).
Assumes: You know transformers, attention, MLPs.
Safety-Focused Path (2.5 hours)
Ch 1 (motivation) → Ch 5-6 (features, superposition) → Ch 9 (SAEs) → Ch 14 (open problems)
Goal: Understand what interpretability can and can’t do for AI safety.
Hands-On Researcher (4 hours)
SetupFirst Analysis → Ch 9-12 (techniques) → Ch 15 (practice) → Exercises
Goal: Start doing interpretability research quickly.
Reference User
Use Quick Reference as a cheat sheet, Zoo of Circuits for known mechanisms, Running Example to see all techniques applied to one behavior.
ImportantWant to Start Coding? (30 minutes)

If you prefer learning by doing:

  1. Environment Setup (5 min) — Get TransformerLens running in Colab
  2. Your First Analysis (25 min) — Complete walkthrough analyzing a real behavior

You’ll have hands-on experience with attribution and patching before reading any theory.

TipWant a Quick Win? (10 minutes)

Want to see what interpretability reveals before diving in?

  1. Explore real features (5 min): Visit Neuronpedia and click “Random Feature.” Look at the max-activating examples. Can you guess what concept the feature represents? Try 3-4 features.

  2. See attention patterns (5 min): Visit the Transformer Explainer to visualize how attention moves information between tokens.

You’ve now seen the two core phenomena this book explains: features (what networks represent) and attention (how they move information).

NoteEstimated Reading Times
Arc Chapters Total Reading Time
Arc I: Foundations 1-4 + Summary ~1 hour
Arc II: Core Theory 5-8 + Summary ~1.5 hours
Arc III: Techniques 9-12 + Summary ~1.5 hours
Arc IV: Synthesis 13-15 ~1 hour
Total 15 chapters + 3 summaries ~5 hours

Each chapter is 12-22 minutes. Take breaks between arcs—the summaries are designed as natural stopping points.

0.3 Who This Is For

  • Software engineers curious about ML internals
  • ML practitioners who want deeper understanding
  • Performance engineers interested in AI
  • Anyone who values understanding why over just how

0.4 Background Assumed

This book is designed to be accessible, but some background helps:

NoteMathematical Background
  • Linear algebra basics: Vectors, matrices, matrix multiplication, dot products
  • High school math: Exponentials, logarithms, basic trigonometry
  • No calculus required: Gradient descent is explained conceptually

If you can multiply a matrix by a vector and know that cos(90°) = 0, you have enough math.

NoteProgramming Background
  • Python familiarity: Code examples use Python and PyTorch
  • No ML experience required: We explain transformers from scratch
  • Helpful but optional: Prior exposure to neural networks accelerates Arc I

0.5 Key Notation

Quick reference for notation used throughout the series:

Symbol Meaning Example
\(x\) Activation vector (residual stream state) \(x \in \mathbb{R}^{768}\)
\(W\) Weight matrix \(W_Q\), \(W_K\), \(W_V\) for attention
\(W_E\) Embedding matrix (tokens → vectors) Maps “Paris” to a 768-dim vector
\(W_U\) Unembedding matrix (vectors → logits) Projects residual stream to vocabulary
\(d\) Model dimension (hidden size) GPT-2 Small: \(d = 768\)
\(n\) Number of features Often \(n >> d\) due to superposition
\(L\) Layer index \(x^{(L)}\) = residual stream at layer \(L\)
\(h\) Attention head index Head \(h\) in layer \(L\)
\(\text{softmax}\) Converts scores to probabilities \(\text{softmax}(z)_i = \frac{e^{z_i}}{\sum_j e^{z_j}}\)

Common terms explained:

  • Logits: Raw (unnormalized) scores before softmax. Higher logit = model thinks token is more likely.
  • Embedding: Converting discrete tokens into continuous vectors the model can process.
  • Unembedding: The reverse—projecting internal vectors back to vocabulary-sized predictions.
  • Hook: (TransformerLens) A callback that lets you read or modify activations during a forward pass.

0.6 The Approach

We use Polya’s problem-solving framework throughout: understand the problem before devising solutions, verify your understanding through intervention, and always ask “what would make this explanation wrong?”

We also bring a performance engineering mindset: you can’t optimize what you don’t understand, measure before you interpret, and never trust unvalidated claims.

0.7 Getting Started

Option 1: Start with theory — Begin with Chapter 1: Why Reverse Engineer Neural Networks? to understand the motivation and scope of the project.

Option 2: Start with code — Jump to Environment Setup and Your First Analysis to get hands-on experience immediately.

0.8 Also Published On

This series is also available on Software Bits on Substack.


0.9 More from the Author

0.9.1 The First Principles Trilogy

This book is part of a series teaching ML fundamentals from first principles:

📘 Building LLMs from First Principles Learn how transformers work by building them from scratch—full math derivations, working code, and comprehensive test suites. From Markov chains to GPT.

🔬 Mechanistic Interpretability from First Principles (You are here) Reverse-engineer neural networks to understand their internal algorithms. Features, superposition, circuits, and sparse autoencoders explained from the ground up.

The Algebra of Speed Mathematical foundations of computational performance. Why FlashAttention, LoRA, and quantization work—and how to recognize when similar optimizations apply to your problems.

0.9.2 Blog

✍️ Software Bits — Short, focused essays on performance, ML, and computer science fundamentals. Subscribe for updates.

💻 GitHub: perf-bits — Blog posts with full code and interactive demos.


Last updated: January 2025