Glossary

This glossary defines key terms used throughout the book. Terms are organized alphabetically.

Arithmetic Intensity: The ratio of floating-point operations to bytes transferred from memory. Measured in FLOPS/byte. High arithmetic intensity operations are compute-bound; low arithmetic intensity operations are memory-bound. See ?sec-bandwidth.
Associativity: A mathematical property where \((a \oplus b) \oplus c = a \oplus (b \oplus c)\) for some operation \(\oplus\). Enables chunked and parallel processing because results can be combined in any order. The foundation of streaming algorithms and online computations.
Bandwidth: The rate at which data can be transferred, measured in bytes per second. Memory bandwidth (HBM to compute units) is often the primary performance bottleneck for modern ML workloads.
Cache Hit: When requested data is found in a faster cache level, avoiding access to slower memory. Cache hit rate strongly affects performance for memory-bound workloads.
Cache Line: The minimum unit of data transfer between memory hierarchy levels. Typically 64 bytes on modern CPUs. Accessing a single byte loads an entire cache line.
Cache Miss: When requested data is not in cache and must be fetched from a slower memory level. The cost of a miss depends on which level the data is found (L2, L3, DRAM).
Compute-Bound: A workload where execution time is limited by the rate of arithmetic operations rather than memory bandwidth. Compute-bound kernels can approach peak FLOPS.
Depthwise Separable Convolution: A factorization of standard convolution into depthwise (spatial) and pointwise (channel-mixing) operations. Reduces parameters and compute by a factor of the kernel size.
DRAM: Dynamic Random Access Memory. Main system memory, slower than SRAM caches but with much larger capacity. Typical latency: 50-100 nanoseconds.
FlashAttention: An algorithm for exact attention computation that uses tiling and online softmax to avoid materializing the O(n²) attention matrix. Reduces memory from O(n²) to O(n).
FLOPS: Floating-Point Operations Per Second. A measure of computational throughput. Modern GPUs achieve hundreds of teraFLOPS (10¹² operations/second).
Fused Kernel: A GPU kernel that combines multiple operations into a single pass over data, reducing memory traffic. Example: fusing ReLU into matrix multiply.
Graph Break: In torch.compile, a point where the compiler cannot continue tracing a single computational graph, forcing a split into separate compiled regions. Caused by data-dependent control flow, Python side effects, or unsupported operations. See ?sec-torch-compile.
GQA (Grouped-Query Attention): A variant of multi-head attention where K/V projections are shared across groups of query heads. A middle ground between standard MHA and Multi-Query Attention. Used in LLaMA-2 and most modern LLMs.
HBM (High Bandwidth Memory): GPU memory with high bandwidth (2-5+ TB/s on modern GPUs) but limited capacity (40-192GB). The primary memory for large tensors.
KV Cache: Cached key and value tensors from previous tokens in autoregressive generation. Avoids redundant recomputation, reducing generation from O(n²) to O(n). Size grows linearly with sequence length and is often the memory bottleneck for inference. See ?sec-inference.
L1/L2/L3 Cache: Hierarchical on-chip memory with decreasing speed and increasing size. L1 is fastest (1-4 cycles), L3 is slowest but largest.
Latency: Time to complete a single operation or request. Contrast with throughput (operations per second).
Little’s Law: A queueing identity: \(L = \lambda W\), relating average in-system requests \(L\), arrival rate \(\lambda\), and average time in system \(W\).
Locality: A property where a function’s output depends primarily on nearby inputs. Enables tiling, blocking, and cache-efficient algorithms. Temporal locality (reuse over time) and spatial locality (adjacent access) are the two main forms.
LoRA (Low-Rank Adaptation): A fine-tuning technique that freezes pretrained weights and adds trainable low-rank matrices. Exploits the observation that weight updates during fine-tuning are approximately low-rank.
Low-Rank Approximation: Approximating a matrix \(W \approx AB\) where \(A\) and \(B\) have lower rank than \(W\). Reduces parameters from \(mn\) to \((m+n)r\) where \(r\) is the rank.
Memory-Bound: A workload where execution time is limited by the rate of data transfer rather than computation. Most ML inference workloads are memory-bound.
Memory Hierarchy: The organization of memory into levels of decreasing speed and increasing capacity: registers → L1 → L2 → L3 → DRAM → SSD → disk.
Mixture of Experts (MoE): An architecture where only a subset of model parameters are activated for each input. Enables larger models without proportional compute increase.
Monoid: An algebraic structure consisting of a set with a closed, associative binary operation and an identity element. The theoretical foundation for parallel reductions and streaming algorithms. Example: (integers, +, 0) is a monoid.
NVLink: High-speed interconnect between GPUs within a node. Provides 600-900 GB/s bidirectional bandwidth per GPU, much faster than PCIe. Critical for tensor parallelism.
Operator Fusion: Combining multiple operations into a single kernel to reduce memory traffic. A key optimization in deep learning compilers like torch.compile.
Prefetching: Loading data into cache before it’s needed, hiding memory latency. Hardware prefetchers detect streaming access patterns; software can provide explicit hints.
Parallel Scan (Prefix Sum): An algorithm that computes all prefix reductions of a sequence using an associative operator. Runs in O(n) work and O(log n) depth. The parallel primitive underlying FlashAttention’s online softmax and SSM parallel computation.
Quantization: Representing numbers with fewer bits (e.g., FP16 instead of FP32, or INT8 instead of FP16). Reduces memory footprint and bandwidth, sometimes with specialized hardware support.
Queueing: The study of waiting lines. Provides models for how arrival rates, service times, and variability determine latency and tail behavior in serving systems.
Roofline Model: A visual model showing maximum achievable performance as a function of arithmetic intensity. Performance is bounded by min(peak FLOPS, bandwidth × intensity).
Redundancy: A property where the information content of data is less than its representation size. Enables quantization, compression, caching, and approximation. Shannon’s entropy measures true information content.
Reversibility: A property where intermediate states can be reconstructed from outputs (or compact summaries), enabling memory savings via recomputation. Common in reversible networks and checkpointing schemes.
Separability: A property where a function can be decomposed into simpler, independent components. Enables factorization optimizations like separable filters and low-rank approximations.
Shared Memory (GPU): Fast on-chip memory accessible to all threads within a GPU thread block. Used for inter-thread communication, data reuse, and manual caching. Distinct from system shared memory.
SIMD: Single Instruction, Multiple Data. Processing multiple data elements with a single instruction. Essential for efficient vectorized code.
Softmax: The function \(\text{softmax}(x)_i = e^{x_i} / \sum_j e^{x_j}\) that converts logits to probabilities. Its decomposable structure (the (max, sum, weighted_sum) state forms a monoid) enables online/streaming computation, which is key to FlashAttention.
Sparsity: Having many zero or near-zero values in a matrix or tensor. Structured sparsity (e.g., 2:4 patterns) enables hardware acceleration.
SRAM (Static RAM): Fast on-chip memory used for caches and GPU shared memory. Lower latency than DRAM but limited capacity. GPU shared memory is typically 100-200KB per SM.
State-Space Model (SSM): A sequence model based on linear recurrences \(h_t = Ah_{t-1} + Bx_t\), \(y_t = Ch_t\). Enables O(n) sequence processing vs. attention’s O(n²). The recurrence’s associativity enables parallel computation via scan. See ?sec-ssm.
Streaming Algorithm: An algorithm that processes data in a single pass, maintaining only a small summary state. Enabled by associative operations with bounded state.
Symmetry: A property where a computation is invariant or equivariant under some transformation. Enables weight sharing (CNNs), Fourier methods, and group equivariant networks. Noether’s theorem connects symmetry to conservation laws.
Tensor Core: Specialized hardware for matrix multiply-accumulate operations. Achieves much higher throughput than standard FP32 cores but requires specific data formats (FP16, BF16, INT8, FP8).
Tensor Parallelism: Distributing individual tensor operations (e.g., a single matrix multiply) across multiple GPUs. Requires high-bandwidth interconnect (NVLink). See ?sec-distributed.
Tiling: Dividing data into blocks that fit in fast memory, processing each tile entirely before moving to the next. Reduces memory traffic at the cost of code complexity.
Throughput: The number of operations completed per unit time. Contrast with latency (time per operation).
Utilization: The fraction of time a server is busy, \(\rho = \lambda / \mu\). As utilization approaches 1, queueing delay grows rapidly.
Warp: A group of 32 threads on NVIDIA GPUs that execute in lockstep (SIMT). All threads in a warp must execute the same instruction; conditional branches cause “warp divergence” where some threads are masked.
Working Set: The set of data actively being accessed. Performance depends on whether the working set fits in each cache level.

Cross-Reference Guide

Concept	Chapter
Array-oriented thinking	Thinking in Arrays
The six properties	The Algebraic Framework
Memory hierarchy	Memory Hierarchy
Bandwidth and roofline	Bandwidth
Parallelism	Parallelism
GPU architecture	GPU Architecture
Associativity	Associativity
Separability	Separability
Locality	Locality
Sparsity	Sparsity
Reversibility	Reversibility
Redundancy	Redundancy
Symmetry	Symmetry
Matrix multiply	Matrix Multiply
FlashAttention	FlashAttention
LoRA	LoRA
Quantization	Quantization
State-space models	State-Space Models
Distributed training	Distributed Training
Inference / KV caching	Inference Optimization
Mixture of Experts	Mixture of Experts
Long context / RoPE	Long-Context Attention
GPU memory management	Mastering GPU Memory
Queueing theory	Queueing Theory for Serving
Advanced serving	Advanced LLM Serving
Measurement	The Art of Measurement
Hypothesis-driven debugging	The Art of Hypothesis
Profiling tools	Profiling Tools
Operator fusion	Fusion
torch.compile	torch.compile Deep Dive
Triton programming	Writing Fast Kernels with Triton
Kernel frameworks	Modern GPU Kernel Frameworks
Hardware-specific opts	Hardware-Specific Optimization
Compiler optimizations	Compiler Optimizations
Numerical precision	Numerical Precision
Production case studies	Production Case Studies