Glossary

This glossary defines key terms used throughout the book. Terms are organized alphabetically.


Arithmetic Intensity
The ratio of floating-point operations to bytes transferred from memory. Measured in FLOPS/byte. High arithmetic intensity operations are compute-bound; low arithmetic intensity operations are memory-bound. See ?sec-bandwidth.
Associativity
A mathematical property where \((a \oplus b) \oplus c = a \oplus (b \oplus c)\) for some operation \(\oplus\). Enables chunked and parallel processing because results can be combined in any order. The foundation of streaming algorithms and online computations.
Bandwidth
The rate at which data can be transferred, measured in bytes per second. Memory bandwidth (HBM to compute units) is often the primary performance bottleneck for modern ML workloads.
Cache Hit
When requested data is found in a faster cache level, avoiding access to slower memory. Cache hit rate strongly affects performance for memory-bound workloads.
Cache Line
The minimum unit of data transfer between memory hierarchy levels. Typically 64 bytes on modern CPUs. Accessing a single byte loads an entire cache line.
Cache Miss
When requested data is not in cache and must be fetched from a slower memory level. The cost of a miss depends on which level the data is found (L2, L3, DRAM).
Compute-Bound
A workload where execution time is limited by the rate of arithmetic operations rather than memory bandwidth. Compute-bound kernels can approach peak FLOPS.
Depthwise Separable Convolution
A factorization of standard convolution into depthwise (spatial) and pointwise (channel-mixing) operations. Reduces parameters and compute by a factor of the kernel size.
DRAM
Dynamic Random Access Memory. Main system memory, slower than SRAM caches but with much larger capacity. Typical latency: 50-100 nanoseconds.
FlashAttention
An algorithm for exact attention computation that uses tiling and online softmax to avoid materializing the O(n²) attention matrix. Reduces memory from O(n²) to O(n).
FLOPS
Floating-Point Operations Per Second. A measure of computational throughput. Modern GPUs achieve hundreds of teraFLOPS (10¹² operations/second).
Fused Kernel
A GPU kernel that combines multiple operations into a single pass over data, reducing memory traffic. Example: fusing ReLU into matrix multiply.
Graph Break
In torch.compile, a point where the compiler cannot continue tracing a single computational graph, forcing a split into separate compiled regions. Caused by data-dependent control flow, Python side effects, or unsupported operations. See ?sec-torch-compile.
GQA (Grouped-Query Attention)
A variant of multi-head attention where K/V projections are shared across groups of query heads. A middle ground between standard MHA and Multi-Query Attention. Used in LLaMA-2 and most modern LLMs.
HBM (High Bandwidth Memory)
GPU memory with high bandwidth (2-5+ TB/s on modern GPUs) but limited capacity (40-192GB). The primary memory for large tensors.
KV Cache
Cached key and value tensors from previous tokens in autoregressive generation. Avoids redundant recomputation, reducing generation from O(n²) to O(n). Size grows linearly with sequence length and is often the memory bottleneck for inference. See ?sec-inference.
L1/L2/L3 Cache
Hierarchical on-chip memory with decreasing speed and increasing size. L1 is fastest (1-4 cycles), L3 is slowest but largest.
Latency
Time to complete a single operation or request. Contrast with throughput (operations per second).
Little’s Law
A queueing identity: \(L = \lambda W\), relating average in-system requests \(L\), arrival rate \(\lambda\), and average time in system \(W\).
Locality
A property where a function’s output depends primarily on nearby inputs. Enables tiling, blocking, and cache-efficient algorithms. Temporal locality (reuse over time) and spatial locality (adjacent access) are the two main forms.
LoRA (Low-Rank Adaptation)
A fine-tuning technique that freezes pretrained weights and adds trainable low-rank matrices. Exploits the observation that weight updates during fine-tuning are approximately low-rank.
Low-Rank Approximation
Approximating a matrix \(W \approx AB\) where \(A\) and \(B\) have lower rank than \(W\). Reduces parameters from \(mn\) to \((m+n)r\) where \(r\) is the rank.
Memory-Bound
A workload where execution time is limited by the rate of data transfer rather than computation. Most ML inference workloads are memory-bound.
Memory Hierarchy
The organization of memory into levels of decreasing speed and increasing capacity: registers → L1 → L2 → L3 → DRAM → SSD → disk.
Mixture of Experts (MoE)
An architecture where only a subset of model parameters are activated for each input. Enables larger models without proportional compute increase.
Monoid
An algebraic structure consisting of a set with a closed, associative binary operation and an identity element. The theoretical foundation for parallel reductions and streaming algorithms. Example: (integers, +, 0) is a monoid.
NVLink
High-speed interconnect between GPUs within a node. Provides 600-900 GB/s bidirectional bandwidth per GPU, much faster than PCIe. Critical for tensor parallelism.
Operator Fusion
Combining multiple operations into a single kernel to reduce memory traffic. A key optimization in deep learning compilers like torch.compile.
Prefetching
Loading data into cache before it’s needed, hiding memory latency. Hardware prefetchers detect streaming access patterns; software can provide explicit hints.
Parallel Scan (Prefix Sum)
An algorithm that computes all prefix reductions of a sequence using an associative operator. Runs in O(n) work and O(log n) depth. The parallel primitive underlying FlashAttention’s online softmax and SSM parallel computation.
Quantization
Representing numbers with fewer bits (e.g., FP16 instead of FP32, or INT8 instead of FP16). Reduces memory footprint and bandwidth, sometimes with specialized hardware support.
Queueing
The study of waiting lines. Provides models for how arrival rates, service times, and variability determine latency and tail behavior in serving systems.
Roofline Model
A visual model showing maximum achievable performance as a function of arithmetic intensity. Performance is bounded by min(peak FLOPS, bandwidth × intensity).
Redundancy
A property where the information content of data is less than its representation size. Enables quantization, compression, caching, and approximation. Shannon’s entropy measures true information content.
Reversibility
A property where intermediate states can be reconstructed from outputs (or compact summaries), enabling memory savings via recomputation. Common in reversible networks and checkpointing schemes.
Separability
A property where a function can be decomposed into simpler, independent components. Enables factorization optimizations like separable filters and low-rank approximations.
Shared Memory (GPU)
Fast on-chip memory accessible to all threads within a GPU thread block. Used for inter-thread communication, data reuse, and manual caching. Distinct from system shared memory.
SIMD
Single Instruction, Multiple Data. Processing multiple data elements with a single instruction. Essential for efficient vectorized code.
Softmax
The function \(\text{softmax}(x)_i = e^{x_i} / \sum_j e^{x_j}\) that converts logits to probabilities. Its decomposable structure (the (max, sum, weighted_sum) state forms a monoid) enables online/streaming computation, which is key to FlashAttention.
Sparsity
Having many zero or near-zero values in a matrix or tensor. Structured sparsity (e.g., 2:4 patterns) enables hardware acceleration.
SRAM (Static RAM)
Fast on-chip memory used for caches and GPU shared memory. Lower latency than DRAM but limited capacity. GPU shared memory is typically 100-200KB per SM.
State-Space Model (SSM)
A sequence model based on linear recurrences \(h_t = Ah_{t-1} + Bx_t\), \(y_t = Ch_t\). Enables O(n) sequence processing vs. attention’s O(n²). The recurrence’s associativity enables parallel computation via scan. See ?sec-ssm.
Streaming Algorithm
An algorithm that processes data in a single pass, maintaining only a small summary state. Enabled by associative operations with bounded state.
Symmetry
A property where a computation is invariant or equivariant under some transformation. Enables weight sharing (CNNs), Fourier methods, and group equivariant networks. Noether’s theorem connects symmetry to conservation laws.
Tensor Core
Specialized hardware for matrix multiply-accumulate operations. Achieves much higher throughput than standard FP32 cores but requires specific data formats (FP16, BF16, INT8, FP8).
Tensor Parallelism
Distributing individual tensor operations (e.g., a single matrix multiply) across multiple GPUs. Requires high-bandwidth interconnect (NVLink). See ?sec-distributed.
Tiling
Dividing data into blocks that fit in fast memory, processing each tile entirely before moving to the next. Reduces memory traffic at the cost of code complexity.
Throughput
The number of operations completed per unit time. Contrast with latency (time per operation).
Utilization
The fraction of time a server is busy, \(\rho = \lambda / \mu\). As utilization approaches 1, queueing delay grows rapidly.
Warp
A group of 32 threads on NVIDIA GPUs that execute in lockstep (SIMT). All threads in a warp must execute the same instruction; conditional branches cause “warp divergence” where some threads are masked.
Working Set
The set of data actively being accessed. Performance depends on whether the working set fits in each cache level.

TipCross-Reference Guide
Concept Chapter
Array-oriented thinking Thinking in Arrays
The six properties The Algebraic Framework
Memory hierarchy Memory Hierarchy
Bandwidth and roofline Bandwidth
Parallelism Parallelism
GPU architecture GPU Architecture
Associativity Associativity
Separability Separability
Locality Locality
Sparsity Sparsity
Reversibility Reversibility
Redundancy Redundancy
Symmetry Symmetry
Matrix multiply Matrix Multiply
FlashAttention FlashAttention
LoRA LoRA
Quantization Quantization
State-space models State-Space Models
Distributed training Distributed Training
Inference / KV caching Inference Optimization
Mixture of Experts Mixture of Experts
Long context / RoPE Long-Context Attention
GPU memory management Mastering GPU Memory
Queueing theory Queueing Theory for Serving
Advanced serving Advanced LLM Serving
Measurement The Art of Measurement
Hypothesis-driven debugging The Art of Hypothesis
Profiling tools Profiling Tools
Operator fusion Fusion
torch.compile torch.compile Deep Dive
Triton programming Writing Fast Kernels with Triton
Kernel frameworks Modern GPU Kernel Frameworks
Hardware-specific opts Hardware-Specific Optimization
Compiler optimizations Compiler Optimizations
Numerical precision Numerical Precision
Production case studies Production Case Studies