Glossary
This glossary defines key terms used throughout the book. Terms are organized alphabetically.
- Arithmetic Intensity
- The ratio of floating-point operations to bytes transferred from memory. Measured in FLOPS/byte. High arithmetic intensity operations are compute-bound; low arithmetic intensity operations are memory-bound. See ?sec-bandwidth.
- Associativity
- A mathematical property where \((a \oplus b) \oplus c = a \oplus (b \oplus c)\) for some operation \(\oplus\). Enables chunked and parallel processing because results can be combined in any order. The foundation of streaming algorithms and online computations.
- Bandwidth
- The rate at which data can be transferred, measured in bytes per second. Memory bandwidth (HBM to compute units) is often the primary performance bottleneck for modern ML workloads.
- Cache Hit
- When requested data is found in a faster cache level, avoiding access to slower memory. Cache hit rate strongly affects performance for memory-bound workloads.
- Cache Line
- The minimum unit of data transfer between memory hierarchy levels. Typically 64 bytes on modern CPUs. Accessing a single byte loads an entire cache line.
- Cache Miss
- When requested data is not in cache and must be fetched from a slower memory level. The cost of a miss depends on which level the data is found (L2, L3, DRAM).
- Compute-Bound
- A workload where execution time is limited by the rate of arithmetic operations rather than memory bandwidth. Compute-bound kernels can approach peak FLOPS.
- Depthwise Separable Convolution
- A factorization of standard convolution into depthwise (spatial) and pointwise (channel-mixing) operations. Reduces parameters and compute by a factor of the kernel size.
- DRAM
- Dynamic Random Access Memory. Main system memory, slower than SRAM caches but with much larger capacity. Typical latency: 50-100 nanoseconds.
- FlashAttention
- An algorithm for exact attention computation that uses tiling and online softmax to avoid materializing the O(n²) attention matrix. Reduces memory from O(n²) to O(n).
- FLOPS
- Floating-Point Operations Per Second. A measure of computational throughput. Modern GPUs achieve hundreds of teraFLOPS (10¹² operations/second).
- Fused Kernel
- A GPU kernel that combines multiple operations into a single pass over data, reducing memory traffic. Example: fusing ReLU into matrix multiply.
- Graph Break
-
In
torch.compile, a point where the compiler cannot continue tracing a single computational graph, forcing a split into separate compiled regions. Caused by data-dependent control flow, Python side effects, or unsupported operations. See ?sec-torch-compile. - GQA (Grouped-Query Attention)
- A variant of multi-head attention where K/V projections are shared across groups of query heads. A middle ground between standard MHA and Multi-Query Attention. Used in LLaMA-2 and most modern LLMs.
- HBM (High Bandwidth Memory)
- GPU memory with high bandwidth (2-5+ TB/s on modern GPUs) but limited capacity (40-192GB). The primary memory for large tensors.
- KV Cache
- Cached key and value tensors from previous tokens in autoregressive generation. Avoids redundant recomputation, reducing generation from O(n²) to O(n). Size grows linearly with sequence length and is often the memory bottleneck for inference. See ?sec-inference.
- L1/L2/L3 Cache
- Hierarchical on-chip memory with decreasing speed and increasing size. L1 is fastest (1-4 cycles), L3 is slowest but largest.
- Latency
- Time to complete a single operation or request. Contrast with throughput (operations per second).
- Little’s Law
- A queueing identity: \(L = \lambda W\), relating average in-system requests \(L\), arrival rate \(\lambda\), and average time in system \(W\).
- Locality
- A property where a function’s output depends primarily on nearby inputs. Enables tiling, blocking, and cache-efficient algorithms. Temporal locality (reuse over time) and spatial locality (adjacent access) are the two main forms.
- LoRA (Low-Rank Adaptation)
- A fine-tuning technique that freezes pretrained weights and adds trainable low-rank matrices. Exploits the observation that weight updates during fine-tuning are approximately low-rank.
- Low-Rank Approximation
- Approximating a matrix \(W \approx AB\) where \(A\) and \(B\) have lower rank than \(W\). Reduces parameters from \(mn\) to \((m+n)r\) where \(r\) is the rank.
- Memory-Bound
- A workload where execution time is limited by the rate of data transfer rather than computation. Most ML inference workloads are memory-bound.
- Memory Hierarchy
- The organization of memory into levels of decreasing speed and increasing capacity: registers → L1 → L2 → L3 → DRAM → SSD → disk.
- Mixture of Experts (MoE)
- An architecture where only a subset of model parameters are activated for each input. Enables larger models without proportional compute increase.
- Monoid
- An algebraic structure consisting of a set with a closed, associative binary operation and an identity element. The theoretical foundation for parallel reductions and streaming algorithms. Example: (integers, +, 0) is a monoid.
- NVLink
- High-speed interconnect between GPUs within a node. Provides 600-900 GB/s bidirectional bandwidth per GPU, much faster than PCIe. Critical for tensor parallelism.
- Operator Fusion
- Combining multiple operations into a single kernel to reduce memory traffic. A key optimization in deep learning compilers like torch.compile.
- Prefetching
- Loading data into cache before it’s needed, hiding memory latency. Hardware prefetchers detect streaming access patterns; software can provide explicit hints.
- Parallel Scan (Prefix Sum)
- An algorithm that computes all prefix reductions of a sequence using an associative operator. Runs in O(n) work and O(log n) depth. The parallel primitive underlying FlashAttention’s online softmax and SSM parallel computation.
- Quantization
- Representing numbers with fewer bits (e.g., FP16 instead of FP32, or INT8 instead of FP16). Reduces memory footprint and bandwidth, sometimes with specialized hardware support.
- Queueing
- The study of waiting lines. Provides models for how arrival rates, service times, and variability determine latency and tail behavior in serving systems.
- Roofline Model
- A visual model showing maximum achievable performance as a function of arithmetic intensity. Performance is bounded by min(peak FLOPS, bandwidth × intensity).
- Redundancy
- A property where the information content of data is less than its representation size. Enables quantization, compression, caching, and approximation. Shannon’s entropy measures true information content.
- Reversibility
- A property where intermediate states can be reconstructed from outputs (or compact summaries), enabling memory savings via recomputation. Common in reversible networks and checkpointing schemes.
- Separability
- A property where a function can be decomposed into simpler, independent components. Enables factorization optimizations like separable filters and low-rank approximations.
- Shared Memory (GPU)
- Fast on-chip memory accessible to all threads within a GPU thread block. Used for inter-thread communication, data reuse, and manual caching. Distinct from system shared memory.
- SIMD
- Single Instruction, Multiple Data. Processing multiple data elements with a single instruction. Essential for efficient vectorized code.
- Softmax
- The function \(\text{softmax}(x)_i = e^{x_i} / \sum_j e^{x_j}\) that converts logits to probabilities. Its decomposable structure (the (max, sum, weighted_sum) state forms a monoid) enables online/streaming computation, which is key to FlashAttention.
- Sparsity
- Having many zero or near-zero values in a matrix or tensor. Structured sparsity (e.g., 2:4 patterns) enables hardware acceleration.
- SRAM (Static RAM)
- Fast on-chip memory used for caches and GPU shared memory. Lower latency than DRAM but limited capacity. GPU shared memory is typically 100-200KB per SM.
- State-Space Model (SSM)
- A sequence model based on linear recurrences \(h_t = Ah_{t-1} + Bx_t\), \(y_t = Ch_t\). Enables O(n) sequence processing vs. attention’s O(n²). The recurrence’s associativity enables parallel computation via scan. See ?sec-ssm.
- Streaming Algorithm
- An algorithm that processes data in a single pass, maintaining only a small summary state. Enabled by associative operations with bounded state.
- Symmetry
- A property where a computation is invariant or equivariant under some transformation. Enables weight sharing (CNNs), Fourier methods, and group equivariant networks. Noether’s theorem connects symmetry to conservation laws.
- Tensor Core
- Specialized hardware for matrix multiply-accumulate operations. Achieves much higher throughput than standard FP32 cores but requires specific data formats (FP16, BF16, INT8, FP8).
- Tensor Parallelism
- Distributing individual tensor operations (e.g., a single matrix multiply) across multiple GPUs. Requires high-bandwidth interconnect (NVLink). See ?sec-distributed.
- Tiling
- Dividing data into blocks that fit in fast memory, processing each tile entirely before moving to the next. Reduces memory traffic at the cost of code complexity.
- Throughput
- The number of operations completed per unit time. Contrast with latency (time per operation).
- Utilization
- The fraction of time a server is busy, \(\rho = \lambda / \mu\). As utilization approaches 1, queueing delay grows rapidly.
- Warp
- A group of 32 threads on NVIDIA GPUs that execute in lockstep (SIMT). All threads in a warp must execute the same instruction; conditional branches cause “warp divergence” where some threads are masked.
- Working Set
- The set of data actively being accessed. Performance depends on whether the working set fits in each cache level.
TipCross-Reference Guide
| Concept | Chapter |
|---|---|
| Array-oriented thinking | Thinking in Arrays |
| The six properties | The Algebraic Framework |
| Memory hierarchy | Memory Hierarchy |
| Bandwidth and roofline | Bandwidth |
| Parallelism | Parallelism |
| GPU architecture | GPU Architecture |
| Associativity | Associativity |
| Separability | Separability |
| Locality | Locality |
| Sparsity | Sparsity |
| Reversibility | Reversibility |
| Redundancy | Redundancy |
| Symmetry | Symmetry |
| Matrix multiply | Matrix Multiply |
| FlashAttention | FlashAttention |
| LoRA | LoRA |
| Quantization | Quantization |
| State-space models | State-Space Models |
| Distributed training | Distributed Training |
| Inference / KV caching | Inference Optimization |
| Mixture of Experts | Mixture of Experts |
| Long context / RoPE | Long-Context Attention |
| GPU memory management | Mastering GPU Memory |
| Queueing theory | Queueing Theory for Serving |
| Advanced serving | Advanced LLM Serving |
| Measurement | The Art of Measurement |
| Hypothesis-driven debugging | The Art of Hypothesis |
| Profiling tools | Profiling Tools |
| Operator fusion | Fusion |
| torch.compile | torch.compile Deep Dive |
| Triton programming | Writing Fast Kernels with Triton |
| Kernel frameworks | Modern GPU Kernel Frameworks |
| Hardware-specific opts | Hardware-Specific Optimization |
| Compiler optimizations | Compiler Optimizations |
| Numerical precision | Numerical Precision |
| Production case studies | Production Case Studies |