30  Glossary

Key terms in mechanistic interpretability

A reference guide to the core terminology used throughout this book. Terms are organized by the arc where they’re introduced.

30.1 Arc I: Foundations

Transformer
A neural network architecture that processes sequences using attention mechanisms and MLPs. The dominant architecture for modern language models. (Chapter 2)
Attention
A mechanism that routes information between positions in a sequence. Each position computes query, key, and value vectors; attention weights are determined by query-key similarity. (Chapter 2)
Attention Pattern
The matrix of attention weights showing how much each position attends to each other position. Visualized as a heatmap where row i, column j shows how much position i attends to position j. (Chapter 2)
Query, Key, Value (Q, K, V)
The three projections computed for each token in attention. Queries ask “what am I looking for?”, keys say “what do I contain?”, values say “what information should I provide?”. Attention weight = softmax(Q · K^T / √d). (Chapter 2)
Softmax
A function that converts a vector of arbitrary scores into a probability distribution. For input z, softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ). Makes all values positive and sum to 1. (Chapter 2)
Logits
Raw, unnormalized scores before applying softmax. In language models, logits are the model’s “confidence” for each vocabulary token before normalization to probabilities. Higher logit = more likely prediction. (Chapter 2)
Embedding
Converting discrete tokens (words, subwords) into continuous vectors the model can process. The embedding matrix W_E maps each token ID to a d-dimensional vector. (Chapter 2)
Unembedding
The reverse of embedding—projecting internal vectors back to vocabulary-sized predictions. The unembedding matrix W_U converts the final residual stream into logits over the vocabulary. (Chapter 2)
Layer Normalization
A technique that normalizes activations to have zero mean and unit variance. Stabilizes training and appears before attention and MLP blocks in most transformers. (Chapter 2)
MLP (Multi-Layer Perceptron)
The feedforward component of a transformer layer. Applies learned nonlinear transformations to each position independently. Often hypothesized to store factual knowledge. (Chapter 2)
Residual Stream
The vector that flows through a transformer, accumulating contributions from each attention head and MLP. All components read from and write to this shared workspace. (Chapter 3)
Logit Lens
A technique for reading intermediate predictions by projecting the residual stream at any layer to vocabulary space using the unembedding matrix. Shows how predictions refine through layers. (Chapter 3)
Linear Representation Hypothesis
The hypothesis that semantic properties are encoded as linear directions in activation space, enabling simple extraction and manipulation of features. (Chapter 4)
Cosine Similarity
A measure of the angle between two vectors, ranging from -1 (opposite) to 1 (same direction). Used to quantify how related two representations are. (Chapter 4)

30.2 Arc II: Core Theory

Feature
A property of the input that a network represents internally. Features are directions in activation space, not individual neurons. The fundamental unit of meaning in neural representations. (Chapter 5)
Monosemantic
A neuron or feature that corresponds to a single, interpretable concept. The ideal case for interpretation. (Chapter 5)
Polysemantic
A neuron that responds to multiple unrelated concepts (e.g., “cat faces” AND “car fronts”). The common case in trained networks, caused by superposition. (Chapter 5)
Superposition
The phenomenon where networks pack more features than dimensions by using almost-orthogonal directions. Enables efficient representation of sparse features but creates polysemanticity. (Chapter 6)
Sparsity
The property that most features activate rarely (e.g., <1% of inputs). Sparsity enables superposition because features rarely co-occur and interfere. (Chapter 6)
Phase Transition
The sharp change in representation strategy at a critical sparsity threshold. Networks switch from dedicated dimensions to heavy superposition as sparsity increases. (Chapter 6)
Toy Model
A simplified network designed to exhibit superposition in a controlled, fully-analyzable setting. Typically an autoencoder compressing n features into d < n dimensions. (Chapter 7)
Circuit
A subnetwork that performs an identifiable computation by connecting features through learned weights. The “molecules” built from feature “atoms.” (Chapter 8)
K-Composition
When one attention head’s output modifies another head’s keys, changing what the second head attends to. The mechanism underlying induction head circuits. (Chapter 8)
Q-Composition
When one attention head’s output modifies another head’s queries, changing what the second head searches for. (Chapter 8)
V-Composition
When one attention head’s output modifies another head’s values, changing what information the second head copies. (Chapter 8)

30.3 Arc III: Techniques

Sparse Autoencoder (SAE)
A neural network trained to decompose activations into a sparse set of interpretable features. The primary tool for extracting monosemantic features from superposed representations. (Chapter 9)
Reconstruction Loss
The component of SAE training that encourages accurate reconstruction of the original activation. Trades off against sparsity loss. (Chapter 9)
Sparsity Loss (L1)
The component of SAE training that encourages most latent features to be zero. Higher sparsity penalty yields fewer active features per input. (Chapter 9)
Attribution
The technique of decomposing an output into per-component contributions. Measures how much each attention head and MLP pushed toward a particular prediction. (Chapter 10)
Logit Attribution
Measuring each component’s contribution to the logit of a specific output token by projecting its output onto the unembedding direction. (Chapter 10)
Activation Patching
A causal intervention technique that replaces a component’s activation from one input with its activation from a different input, measuring the effect on output. (Chapter 11)
Clean/Corrupted Paradigm
The standard patching setup using two inputs: a “clean” input producing the target behavior and a “corrupted” input producing different behavior. (Chapter 11)
Path Patching
Patching along specific computational paths (e.g., head A’s contribution to head B’s keys) to isolate information flow between components. (Chapter 11)
Ablation
Removing a component’s contribution entirely (setting to zero, mean, or resampled value) to test whether it’s necessary for a behavior. (Chapter 12)
Zero Ablation
Setting a component’s output to zero. Simple but can cause distribution shift. (Chapter 12)
Mean Ablation
Replacing a component’s output with its average value across a dataset. Reduces distribution shift but removes all input-dependent information. (Chapter 12)
Resample Ablation
Replacing a component’s output with its value from a random different input. Preserves marginal distribution but breaks input-specific computation. (Chapter 12)
Steering
Modifying model behavior by adding or subtracting feature directions from the residual stream during inference. More targeted than prompting, more reversible than fine-tuning. (Chapter 9)
Feature Absorption
A failure mode where increasing SAE dictionary size causes common features to “absorb” related features, preventing them from activating. A fundamental challenge for hierarchical concepts. (Chapter 9)
Dead Features
SAE latent dimensions that never activate during training or inference. Represent wasted capacity; rates of 5-20% are common. (Chapter 9)
Out-of-Distribution (OOD)
Inputs that differ significantly from the training distribution. Interventions like patching and ablation can create OOD activations, potentially confounding results. (Chapter 11)

30.4 Arc IV: Synthesis

Induction Head
An attention head that implements in-context learning by finding previous occurrences of the current token and predicting what followed. Part of a two-layer circuit with a previous-token head. (Chapter 13)
Previous Token Head
An attention head that copies information about the previous token at each position. Enables induction heads to search for matching patterns. (Chapter 13)
In-Context Learning (ICL)
The ability to learn from examples in the prompt without weight updates. Enabled primarily by induction heads. (Chapter 13)
Grokking
The phenomenon where a network suddenly generalizes after extended training past the point of memorization. Often involves discovering efficient algorithms like induction. (Chapter 1)

30.5 Tools & Libraries

TransformerLens
The standard Python library for mechanistic interpretability, providing hooks for accessing intermediate activations and running interventions. (Chapter 15)
SAELens
A library for training and analyzing sparse autoencoders on transformer activations. (Chapter 15)
Neuronpedia
An interactive platform for exploring SAE features, visualizing their activating examples, and sharing interpretations. (Chapter 15)

TipQuick Reference
Term One-Line Definition
Logits Raw scores before softmax; higher = more likely
Softmax Converts scores to probabilities (sum to 1)
Residual Stream The shared vector all components read from and write to
Feature A direction in activation space encoding a semantic property
Superposition Packing more features than dimensions using near-orthogonality
Polysemantic A neuron responding to multiple unrelated concepts
Circuit A subnetwork implementing an identifiable algorithm
SAE Tool for decomposing polysemantic neurons into monosemantic features
Steering Adding/subtracting feature directions to modify behavior
Attribution Measuring per-component contributions to outputs
Patching Causal intervention by swapping activations between inputs
Ablation Testing necessity by removing component contributions
Induction Head Attention head enabling in-context pattern completion
Out-of-Distribution Inputs unlike training data; can confound interventions