30 Glossary
Key terms in mechanistic interpretability
A reference guide to the core terminology used throughout this book. Terms are organized by the arc where they’re introduced.
30.1 Arc I: Foundations
- Transformer
- A neural network architecture that processes sequences using attention mechanisms and MLPs. The dominant architecture for modern language models. (Chapter 2)
- Attention
- A mechanism that routes information between positions in a sequence. Each position computes query, key, and value vectors; attention weights are determined by query-key similarity. (Chapter 2)
- Attention Pattern
- The matrix of attention weights showing how much each position attends to each other position. Visualized as a heatmap where row i, column j shows how much position i attends to position j. (Chapter 2)
- Query, Key, Value (Q, K, V)
- The three projections computed for each token in attention. Queries ask “what am I looking for?”, keys say “what do I contain?”, values say “what information should I provide?”. Attention weight = softmax(Q · K^T / √d). (Chapter 2)
- Softmax
- A function that converts a vector of arbitrary scores into a probability distribution. For input z, softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ). Makes all values positive and sum to 1. (Chapter 2)
- Logits
- Raw, unnormalized scores before applying softmax. In language models, logits are the model’s “confidence” for each vocabulary token before normalization to probabilities. Higher logit = more likely prediction. (Chapter 2)
- Embedding
- Converting discrete tokens (words, subwords) into continuous vectors the model can process. The embedding matrix W_E maps each token ID to a d-dimensional vector. (Chapter 2)
- Unembedding
- The reverse of embedding—projecting internal vectors back to vocabulary-sized predictions. The unembedding matrix W_U converts the final residual stream into logits over the vocabulary. (Chapter 2)
- Layer Normalization
- A technique that normalizes activations to have zero mean and unit variance. Stabilizes training and appears before attention and MLP blocks in most transformers. (Chapter 2)
- MLP (Multi-Layer Perceptron)
- The feedforward component of a transformer layer. Applies learned nonlinear transformations to each position independently. Often hypothesized to store factual knowledge. (Chapter 2)
- Residual Stream
- The vector that flows through a transformer, accumulating contributions from each attention head and MLP. All components read from and write to this shared workspace. (Chapter 3)
- Logit Lens
- A technique for reading intermediate predictions by projecting the residual stream at any layer to vocabulary space using the unembedding matrix. Shows how predictions refine through layers. (Chapter 3)
- Linear Representation Hypothesis
- The hypothesis that semantic properties are encoded as linear directions in activation space, enabling simple extraction and manipulation of features. (Chapter 4)
- Cosine Similarity
- A measure of the angle between two vectors, ranging from -1 (opposite) to 1 (same direction). Used to quantify how related two representations are. (Chapter 4)
30.2 Arc II: Core Theory
- Feature
- A property of the input that a network represents internally. Features are directions in activation space, not individual neurons. The fundamental unit of meaning in neural representations. (Chapter 5)
- Monosemantic
- A neuron or feature that corresponds to a single, interpretable concept. The ideal case for interpretation. (Chapter 5)
- Polysemantic
- A neuron that responds to multiple unrelated concepts (e.g., “cat faces” AND “car fronts”). The common case in trained networks, caused by superposition. (Chapter 5)
- Superposition
- The phenomenon where networks pack more features than dimensions by using almost-orthogonal directions. Enables efficient representation of sparse features but creates polysemanticity. (Chapter 6)
- Sparsity
- The property that most features activate rarely (e.g., <1% of inputs). Sparsity enables superposition because features rarely co-occur and interfere. (Chapter 6)
- Phase Transition
- The sharp change in representation strategy at a critical sparsity threshold. Networks switch from dedicated dimensions to heavy superposition as sparsity increases. (Chapter 6)
- Toy Model
- A simplified network designed to exhibit superposition in a controlled, fully-analyzable setting. Typically an autoencoder compressing n features into d < n dimensions. (Chapter 7)
- Circuit
- A subnetwork that performs an identifiable computation by connecting features through learned weights. The “molecules” built from feature “atoms.” (Chapter 8)
- K-Composition
- When one attention head’s output modifies another head’s keys, changing what the second head attends to. The mechanism underlying induction head circuits. (Chapter 8)
- Q-Composition
- When one attention head’s output modifies another head’s queries, changing what the second head searches for. (Chapter 8)
- V-Composition
- When one attention head’s output modifies another head’s values, changing what information the second head copies. (Chapter 8)
30.3 Arc III: Techniques
- Sparse Autoencoder (SAE)
- A neural network trained to decompose activations into a sparse set of interpretable features. The primary tool for extracting monosemantic features from superposed representations. (Chapter 9)
- Reconstruction Loss
- The component of SAE training that encourages accurate reconstruction of the original activation. Trades off against sparsity loss. (Chapter 9)
- Sparsity Loss (L1)
- The component of SAE training that encourages most latent features to be zero. Higher sparsity penalty yields fewer active features per input. (Chapter 9)
- Attribution
- The technique of decomposing an output into per-component contributions. Measures how much each attention head and MLP pushed toward a particular prediction. (Chapter 10)
- Logit Attribution
- Measuring each component’s contribution to the logit of a specific output token by projecting its output onto the unembedding direction. (Chapter 10)
- Activation Patching
- A causal intervention technique that replaces a component’s activation from one input with its activation from a different input, measuring the effect on output. (Chapter 11)
- Clean/Corrupted Paradigm
- The standard patching setup using two inputs: a “clean” input producing the target behavior and a “corrupted” input producing different behavior. (Chapter 11)
- Path Patching
- Patching along specific computational paths (e.g., head A’s contribution to head B’s keys) to isolate information flow between components. (Chapter 11)
- Ablation
- Removing a component’s contribution entirely (setting to zero, mean, or resampled value) to test whether it’s necessary for a behavior. (Chapter 12)
- Zero Ablation
- Setting a component’s output to zero. Simple but can cause distribution shift. (Chapter 12)
- Mean Ablation
- Replacing a component’s output with its average value across a dataset. Reduces distribution shift but removes all input-dependent information. (Chapter 12)
- Resample Ablation
- Replacing a component’s output with its value from a random different input. Preserves marginal distribution but breaks input-specific computation. (Chapter 12)
- Steering
- Modifying model behavior by adding or subtracting feature directions from the residual stream during inference. More targeted than prompting, more reversible than fine-tuning. (Chapter 9)
- Feature Absorption
- A failure mode where increasing SAE dictionary size causes common features to “absorb” related features, preventing them from activating. A fundamental challenge for hierarchical concepts. (Chapter 9)
- Dead Features
- SAE latent dimensions that never activate during training or inference. Represent wasted capacity; rates of 5-20% are common. (Chapter 9)
- Out-of-Distribution (OOD)
- Inputs that differ significantly from the training distribution. Interventions like patching and ablation can create OOD activations, potentially confounding results. (Chapter 11)
30.4 Arc IV: Synthesis
- Induction Head
- An attention head that implements in-context learning by finding previous occurrences of the current token and predicting what followed. Part of a two-layer circuit with a previous-token head. (Chapter 13)
- Previous Token Head
- An attention head that copies information about the previous token at each position. Enables induction heads to search for matching patterns. (Chapter 13)
- In-Context Learning (ICL)
- The ability to learn from examples in the prompt without weight updates. Enabled primarily by induction heads. (Chapter 13)
- Grokking
- The phenomenon where a network suddenly generalizes after extended training past the point of memorization. Often involves discovering efficient algorithms like induction. (Chapter 1)
30.5 Tools & Libraries
- TransformerLens
- The standard Python library for mechanistic interpretability, providing hooks for accessing intermediate activations and running interventions. (Chapter 15)
- SAELens
- A library for training and analyzing sparse autoencoders on transformer activations. (Chapter 15)
- Neuronpedia
- An interactive platform for exploring SAE features, visualizing their activating examples, and sharing interpretations. (Chapter 15)
TipQuick Reference
| Term | One-Line Definition |
|---|---|
| Logits | Raw scores before softmax; higher = more likely |
| Softmax | Converts scores to probabilities (sum to 1) |
| Residual Stream | The shared vector all components read from and write to |
| Feature | A direction in activation space encoding a semantic property |
| Superposition | Packing more features than dimensions using near-orthogonality |
| Polysemantic | A neuron responding to multiple unrelated concepts |
| Circuit | A subnetwork implementing an identifiable algorithm |
| SAE | Tool for decomposing polysemantic neurons into monosemantic features |
| Steering | Adding/subtracting feature directions to modify behavior |
| Attribution | Measuring per-component contributions to outputs |
| Patching | Causal intervention by swapping activations between inputs |
| Ablation | Testing necessity by removing component contributions |
| Induction Head | Attention head enabling in-context pattern completion |
| Out-of-Distribution | Inputs unlike training data; can confound interventions |