30 Glossary

Key terms in mechanistic interpretability

A reference guide to the core terminology used throughout this book. Terms are organized by the arc where they’re introduced.

30.1 Arc I: Foundations

Transformer: A neural network architecture that processes sequences using attention mechanisms and MLPs. The dominant architecture for modern language models. (Chapter 2)
Attention: A mechanism that routes information between positions in a sequence. Each position computes query, key, and value vectors; attention weights are determined by query-key similarity. (Chapter 2)
Attention Pattern: The matrix of attention weights showing how much each position attends to each other position. Visualized as a heatmap where row i, column j shows how much position i attends to position j. (Chapter 2)
Query, Key, Value (Q, K, V): The three projections computed for each token in attention. Queries ask “what am I looking for?”, keys say “what do I contain?”, values say “what information should I provide?”. Attention weight = softmax(Q · K^T / √d). (Chapter 2)
Softmax: A function that converts a vector of arbitrary scores into a probability distribution. For input z, softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ). Makes all values positive and sum to 1. (Chapter 2)
Logits: Raw, unnormalized scores before applying softmax. In language models, logits are the model’s “confidence” for each vocabulary token before normalization to probabilities. Higher logit = more likely prediction. (Chapter 2)
Embedding: Converting discrete tokens (words, subwords) into continuous vectors the model can process. The embedding matrix W_E maps each token ID to a d-dimensional vector. (Chapter 2)
Unembedding: The reverse of embedding—projecting internal vectors back to vocabulary-sized predictions. The unembedding matrix W_U converts the final residual stream into logits over the vocabulary. (Chapter 2)
Layer Normalization: A technique that normalizes activations to have zero mean and unit variance. Stabilizes training and appears before attention and MLP blocks in most transformers. (Chapter 2)
MLP (Multi-Layer Perceptron): The feedforward component of a transformer layer. Applies learned nonlinear transformations to each position independently. Often hypothesized to store factual knowledge. (Chapter 2)
Residual Stream: The vector that flows through a transformer, accumulating contributions from each attention head and MLP. All components read from and write to this shared workspace. (Chapter 3)
Logit Lens: A technique for reading intermediate predictions by projecting the residual stream at any layer to vocabulary space using the unembedding matrix. Shows how predictions refine through layers. (Chapter 3)
Linear Representation Hypothesis: The hypothesis that semantic properties are encoded as linear directions in activation space, enabling simple extraction and manipulation of features. (Chapter 4)
Cosine Similarity: A measure of the angle between two vectors, ranging from -1 (opposite) to 1 (same direction). Used to quantify how related two representations are. (Chapter 4)

30.2 Arc II: Core Theory

Feature: A property of the input that a network represents internally. Features are directions in activation space, not individual neurons. The fundamental unit of meaning in neural representations. (Chapter 5)
Monosemantic: A neuron or feature that corresponds to a single, interpretable concept. The ideal case for interpretation. (Chapter 5)
Polysemantic: A neuron that responds to multiple unrelated concepts (e.g., “cat faces” AND “car fronts”). The common case in trained networks, caused by superposition. (Chapter 5)
Superposition: The phenomenon where networks pack more features than dimensions by using almost-orthogonal directions. Enables efficient representation of sparse features but creates polysemanticity. (Chapter 6)
Sparsity: The property that most features activate rarely (e.g., <1% of inputs). Sparsity enables superposition because features rarely co-occur and interfere. (Chapter 6)
Phase Transition: The sharp change in representation strategy at a critical sparsity threshold. Networks switch from dedicated dimensions to heavy superposition as sparsity increases. (Chapter 6)
Toy Model: A simplified network designed to exhibit superposition in a controlled, fully-analyzable setting. Typically an autoencoder compressing n features into d < n dimensions. (Chapter 7)
Circuit: A subnetwork that performs an identifiable computation by connecting features through learned weights. The “molecules” built from feature “atoms.” (Chapter 8)
K-Composition: When one attention head’s output modifies another head’s keys, changing what the second head attends to. The mechanism underlying induction head circuits. (Chapter 8)
Q-Composition: When one attention head’s output modifies another head’s queries, changing what the second head searches for. (Chapter 8)
V-Composition: When one attention head’s output modifies another head’s values, changing what information the second head copies. (Chapter 8)

30.3 Arc III: Techniques

Sparse Autoencoder (SAE): A neural network trained to decompose activations into a sparse set of interpretable features. The primary tool for extracting monosemantic features from superposed representations. (Chapter 9)
Reconstruction Loss: The component of SAE training that encourages accurate reconstruction of the original activation. Trades off against sparsity loss. (Chapter 9)
Sparsity Loss (L1): The component of SAE training that encourages most latent features to be zero. Higher sparsity penalty yields fewer active features per input. (Chapter 9)
Attribution: The technique of decomposing an output into per-component contributions. Measures how much each attention head and MLP pushed toward a particular prediction. (Chapter 10)
Logit Attribution: Measuring each component’s contribution to the logit of a specific output token by projecting its output onto the unembedding direction. (Chapter 10)
Activation Patching: A causal intervention technique that replaces a component’s activation from one input with its activation from a different input, measuring the effect on output. (Chapter 11)
Clean/Corrupted Paradigm: The standard patching setup using two inputs: a “clean” input producing the target behavior and a “corrupted” input producing different behavior. (Chapter 11)
Path Patching: Patching along specific computational paths (e.g., head A’s contribution to head B’s keys) to isolate information flow between components. (Chapter 11)
Ablation: Removing a component’s contribution entirely (setting to zero, mean, or resampled value) to test whether it’s necessary for a behavior. (Chapter 12)
Zero Ablation: Setting a component’s output to zero. Simple but can cause distribution shift. (Chapter 12)
Mean Ablation: Replacing a component’s output with its average value across a dataset. Reduces distribution shift but removes all input-dependent information. (Chapter 12)
Resample Ablation: Replacing a component’s output with its value from a random different input. Preserves marginal distribution but breaks input-specific computation. (Chapter 12)
Steering: Modifying model behavior by adding or subtracting feature directions from the residual stream during inference. More targeted than prompting, more reversible than fine-tuning. (Chapter 9)
Feature Absorption: A failure mode where increasing SAE dictionary size causes common features to “absorb” related features, preventing them from activating. A fundamental challenge for hierarchical concepts. (Chapter 9)
Dead Features: SAE latent dimensions that never activate during training or inference. Represent wasted capacity; rates of 5-20% are common. (Chapter 9)
Out-of-Distribution (OOD): Inputs that differ significantly from the training distribution. Interventions like patching and ablation can create OOD activations, potentially confounding results. (Chapter 11)

30.4 Arc IV: Synthesis

Induction Head: An attention head that implements in-context learning by finding previous occurrences of the current token and predicting what followed. Part of a two-layer circuit with a previous-token head. (Chapter 13)
Previous Token Head: An attention head that copies information about the previous token at each position. Enables induction heads to search for matching patterns. (Chapter 13)
In-Context Learning (ICL): The ability to learn from examples in the prompt without weight updates. Enabled primarily by induction heads. (Chapter 13)
Grokking: The phenomenon where a network suddenly generalizes after extended training past the point of memorization. Often involves discovering efficient algorithms like induction. (Chapter 1)

30.5 Tools & Libraries

TransformerLens: The standard Python library for mechanistic interpretability, providing hooks for accessing intermediate activations and running interventions. (Chapter 15)
SAELens: A library for training and analyzing sparse autoencoders on transformer activations. (Chapter 15)
Neuronpedia: An interactive platform for exploring SAE features, visualizing their activating examples, and sharing interpretations. (Chapter 15)

Quick Reference

Term	One-Line Definition
Logits	Raw scores before softmax; higher = more likely
Softmax	Converts scores to probabilities (sum to 1)
Residual Stream	The shared vector all components read from and write to
Feature	A direction in activation space encoding a semantic property
Superposition	Packing more features than dimensions using near-orthogonality
Polysemantic	A neuron responding to multiple unrelated concepts
Circuit	A subnetwork implementing an identifiable algorithm
SAE	Tool for decomposing polysemantic neurons into monosemantic features
Steering	Adding/subtracting feature directions to modify behavior
Attribution	Measuring per-component contributions to outputs
Patching	Causal intervention by swapping activations between inputs
Ablation	Testing necessity by removing component contributions
Induction Head	Attention head enabling in-context pattern completion
Out-of-Distribution	Inputs unlike training data; can confound interventions

--- title: "Glossary" subtitle: "Key terms in mechanistic interpretability" --- A reference guide to the core terminology used throughout this book. Terms are organized by the arc where they're introduced. ## Arc I: Foundations **Transformer** : A neural network architecture that processes sequences using attention mechanisms and MLPs. The dominant architecture for modern language models. ([Chapter 2](chapters/02-transformers.qmd)) **Attention** : A mechanism that routes information between positions in a sequence. Each position computes query, key, and value vectors; attention weights are determined by query-key similarity. ([Chapter 2](chapters/02-transformers.qmd)) **Attention Pattern** : The matrix of attention weights showing how much each position attends to each other position. Visualized as a heatmap where row *i*, column *j* shows how much position *i* attends to position *j*. ([Chapter 2](chapters/02-transformers.qmd)) **Query, Key, Value (Q, K, V)** : The three projections computed for each token in attention. Queries ask "what am I looking for?", keys say "what do I contain?", values say "what information should I provide?". Attention weight = softmax(Q · K^T / √d). ([Chapter 2](chapters/02-transformers.qmd)) **Softmax** : A function that converts a vector of arbitrary scores into a probability distribution. For input z, softmax(z)ᵢ = exp(zᵢ) / Σⱼ exp(zⱼ). Makes all values positive and sum to 1. ([Chapter 2](chapters/02-transformers.qmd)) **Logits** : Raw, unnormalized scores before applying softmax. In language models, logits are the model's "confidence" for each vocabulary token before normalization to probabilities. Higher logit = more likely prediction. ([Chapter 2](chapters/02-transformers.qmd)) **Embedding** : Converting discrete tokens (words, subwords) into continuous vectors the model can process. The embedding matrix W_E maps each token ID to a d-dimensional vector. ([Chapter 2](chapters/02-transformers.qmd)) **Unembedding** : The reverse of embedding—projecting internal vectors back to vocabulary-sized predictions. The unembedding matrix W_U converts the final residual stream into logits over the vocabulary. ([Chapter 2](chapters/02-transformers.qmd)) **Layer Normalization** : A technique that normalizes activations to have zero mean and unit variance. Stabilizes training and appears before attention and MLP blocks in most transformers. ([Chapter 2](chapters/02-transformers.qmd)) **MLP (Multi-Layer Perceptron)** : The feedforward component of a transformer layer. Applies learned nonlinear transformations to each position independently. Often hypothesized to store factual knowledge. ([Chapter 2](chapters/02-transformers.qmd)) **Residual Stream** : The vector that flows through a transformer, accumulating contributions from each attention head and MLP. All components read from and write to this shared workspace. ([Chapter 3](chapters/03-residual-stream.qmd)) **Logit Lens** : A technique for reading intermediate predictions by projecting the residual stream at any layer to vocabulary space using the unembedding matrix. Shows how predictions refine through layers. ([Chapter 3](chapters/03-residual-stream.qmd)) **Linear Representation Hypothesis** : The hypothesis that semantic properties are encoded as linear directions in activation space, enabling simple extraction and manipulation of features. ([Chapter 4](chapters/04-geometry.qmd)) **Cosine Similarity** : A measure of the angle between two vectors, ranging from -1 (opposite) to 1 (same direction). Used to quantify how related two representations are. ([Chapter 4](chapters/04-geometry.qmd)) ## Arc II: Core Theory **Feature** : A property of the input that a network represents internally. Features are directions in activation space, not individual neurons. The fundamental unit of meaning in neural representations. ([Chapter 5](chapters/05-features.qmd)) **Monosemantic** : A neuron or feature that corresponds to a single, interpretable concept. The ideal case for interpretation. ([Chapter 5](chapters/05-features.qmd)) **Polysemantic** : A neuron that responds to multiple unrelated concepts (e.g., "cat faces" AND "car fronts"). The common case in trained networks, caused by superposition. ([Chapter 5](chapters/05-features.qmd)) **Superposition** : The phenomenon where networks pack more features than dimensions by using almost-orthogonal directions. Enables efficient representation of sparse features but creates polysemanticity. ([Chapter 6](chapters/06-superposition.qmd)) **Sparsity** : The property that most features activate rarely (e.g., <1% of inputs). Sparsity enables superposition because features rarely co-occur and interfere. ([Chapter 6](chapters/06-superposition.qmd)) **Phase Transition** : The sharp change in representation strategy at a critical sparsity threshold. Networks switch from dedicated dimensions to heavy superposition as sparsity increases. ([Chapter 6](chapters/06-superposition.qmd)) **Toy Model** : A simplified network designed to exhibit superposition in a controlled, fully-analyzable setting. Typically an autoencoder compressing n features into d < n dimensions. ([Chapter 7](chapters/07-toy-models.qmd)) **Circuit** : A subnetwork that performs an identifiable computation by connecting features through learned weights. The "molecules" built from feature "atoms." ([Chapter 8](chapters/08-circuits.qmd)) **K-Composition** : When one attention head's output modifies another head's keys, changing what the second head attends to. The mechanism underlying induction head circuits. ([Chapter 8](chapters/08-circuits.qmd)) **Q-Composition** : When one attention head's output modifies another head's queries, changing what the second head searches for. ([Chapter 8](chapters/08-circuits.qmd)) **V-Composition** : When one attention head's output modifies another head's values, changing what information the second head copies. ([Chapter 8](chapters/08-circuits.qmd)) ## Arc III: Techniques **Sparse Autoencoder (SAE)** : A neural network trained to decompose activations into a sparse set of interpretable features. The primary tool for extracting monosemantic features from superposed representations. ([Chapter 9](chapters/09-sparse-autoencoders.qmd)) **Reconstruction Loss** : The component of SAE training that encourages accurate reconstruction of the original activation. Trades off against sparsity loss. ([Chapter 9](chapters/09-sparse-autoencoders.qmd)) **Sparsity Loss (L1)** : The component of SAE training that encourages most latent features to be zero. Higher sparsity penalty yields fewer active features per input. ([Chapter 9](chapters/09-sparse-autoencoders.qmd)) **Attribution** : The technique of decomposing an output into per-component contributions. Measures how much each attention head and MLP pushed toward a particular prediction. ([Chapter 10](chapters/10-attribution.qmd)) **Logit Attribution** : Measuring each component's contribution to the logit of a specific output token by projecting its output onto the unembedding direction. ([Chapter 10](chapters/10-attribution.qmd)) **Activation Patching** : A causal intervention technique that replaces a component's activation from one input with its activation from a different input, measuring the effect on output. ([Chapter 11](chapters/11-patching.qmd)) **Clean/Corrupted Paradigm** : The standard patching setup using two inputs: a "clean" input producing the target behavior and a "corrupted" input producing different behavior. ([Chapter 11](chapters/11-patching.qmd)) **Path Patching** : Patching along specific computational paths (e.g., head A's contribution to head B's keys) to isolate information flow between components. ([Chapter 11](chapters/11-patching.qmd)) **Ablation** : Removing a component's contribution entirely (setting to zero, mean, or resampled value) to test whether it's necessary for a behavior. ([Chapter 12](chapters/12-ablation.qmd)) **Zero Ablation** : Setting a component's output to zero. Simple but can cause distribution shift. ([Chapter 12](chapters/12-ablation.qmd)) **Mean Ablation** : Replacing a component's output with its average value across a dataset. Reduces distribution shift but removes all input-dependent information. ([Chapter 12](chapters/12-ablation.qmd)) **Resample Ablation** : Replacing a component's output with its value from a random different input. Preserves marginal distribution but breaks input-specific computation. ([Chapter 12](chapters/12-ablation.qmd)) **Steering** : Modifying model behavior by adding or subtracting feature directions from the residual stream during inference. More targeted than prompting, more reversible than fine-tuning. ([Chapter 9](chapters/09-sparse-autoencoders.qmd)) **Feature Absorption** : A failure mode where increasing SAE dictionary size causes common features to "absorb" related features, preventing them from activating. A fundamental challenge for hierarchical concepts. ([Chapter 9](chapters/09-sparse-autoencoders.qmd)) **Dead Features** : SAE latent dimensions that never activate during training or inference. Represent wasted capacity; rates of 5-20% are common. ([Chapter 9](chapters/09-sparse-autoencoders.qmd)) **Out-of-Distribution (OOD)** : Inputs that differ significantly from the training distribution. Interventions like patching and ablation can create OOD activations, potentially confounding results. ([Chapter 11](chapters/11-patching.qmd)) ## Arc IV: Synthesis **Induction Head** : An attention head that implements in-context learning by finding previous occurrences of the current token and predicting what followed. Part of a two-layer circuit with a previous-token head. ([Chapter 13](chapters/13-induction-heads.qmd)) **Previous Token Head** : An attention head that copies information about the previous token at each position. Enables induction heads to search for matching patterns. ([Chapter 13](chapters/13-induction-heads.qmd)) **In-Context Learning (ICL)** : The ability to learn from examples in the prompt without weight updates. Enabled primarily by induction heads. ([Chapter 13](chapters/13-induction-heads.qmd)) **Grokking** : The phenomenon where a network suddenly generalizes after extended training past the point of memorization. Often involves discovering efficient algorithms like induction. ([Chapter 1](chapters/01-why-reverse-engineer.qmd)) ## Tools & Libraries **TransformerLens** : The standard Python library for mechanistic interpretability, providing hooks for accessing intermediate activations and running interventions. ([Chapter 15](chapters/15-practice-regime.qmd)) **SAELens** : A library for training and analyzing sparse autoencoders on transformer activations. ([Chapter 15](chapters/15-practice-regime.qmd)) **Neuronpedia** : An interactive platform for exploring SAE features, visualizing their activating examples, and sharing interpretations. ([Chapter 15](chapters/15-practice-regime.qmd)) --- ::: {.callout-tip} ## Quick Reference | Term | One-Line Definition | |------|---------------------| | Logits | Raw scores before softmax; higher = more likely | | Softmax | Converts scores to probabilities (sum to 1) | | Residual Stream | The shared vector all components read from and write to | | Feature | A direction in activation space encoding a semantic property | | Superposition | Packing more features than dimensions using near-orthogonality | | Polysemantic | A neuron responding to multiple unrelated concepts | | Circuit | A subnetwork implementing an identifiable algorithm | | SAE | Tool for decomposing polysemantic neurons into monosemantic features | | Steering | Adding/subtracting feature directions to modify behavior | | Attribution | Measuring per-component contributions to outputs | | Patching | Causal intervention by swapping activations between inputs | | Ablation | Testing necessity by removing component contributions | | Induction Head | Attention head enabling in-context pattern completion | | Out-of-Distribution | Inputs unlike training data; can confound interventions | :::