Glossary

This glossary defines key terms used throughout the book. Terms are organized alphabetically.


Arithmetic Intensity
The ratio of floating-point operations to bytes transferred from memory. Measured in FLOPS/byte. High arithmetic intensity operations are compute-bound; low arithmetic intensity operations are memory-bound. See ?sec-bandwidth.
Associativity
A mathematical property where \((a \oplus b) \oplus c = a \oplus (b \oplus c)\) for some operation \(\oplus\). Enables chunked and parallel processing because results can be combined in any order. The foundation of streaming algorithms and online computations.
Bandwidth
The rate at which data can be transferred, measured in bytes per second. Memory bandwidth (HBM to compute units) is often the primary performance bottleneck for modern ML workloads.
Cache Hit
When requested data is found in a faster cache level, avoiding access to slower memory. Cache hit rate strongly affects performance for memory-bound workloads.
Cache Line
The minimum unit of data transfer between memory hierarchy levels. Typically 64 bytes on modern CPUs. Accessing a single byte loads an entire cache line.
Cache Miss
When requested data is not in cache and must be fetched from a slower memory level. The cost of a miss depends on which level the data is found (L2, L3, DRAM).
Compute-Bound
A workload where execution time is limited by the rate of arithmetic operations rather than memory bandwidth. Compute-bound kernels can approach peak FLOPS.
Depthwise Separable Convolution
A factorization of standard convolution into depthwise (spatial) and pointwise (channel-mixing) operations. Reduces parameters and compute by a factor of the kernel size.
DRAM
Dynamic Random Access Memory. Main system memory, slower than SRAM caches but with much larger capacity. Typical latency: 50-100 nanoseconds.
FlashAttention
An algorithm for exact attention computation that uses tiling and online softmax to avoid materializing the O(n²) attention matrix. Reduces memory from O(n²) to O(n).
FLOPS
Floating-Point Operations Per Second. A measure of computational throughput. Modern GPUs achieve hundreds of teraFLOPS (10¹² operations/second).
Fused Kernel
A GPU kernel that combines multiple operations into a single pass over data, reducing memory traffic. Example: fusing ReLU into matrix multiply.
HBM (High Bandwidth Memory)
GPU memory with high bandwidth (3+ TB/s on modern GPUs) but limited capacity (40-80GB). The primary memory for large tensors.
L1/L2/L3 Cache
Hierarchical on-chip memory with decreasing speed and increasing size. L1 is fastest (1-4 cycles), L3 is slowest but largest.
Latency
Time to complete a single operation or request. Contrast with throughput (operations per second).
LoRA (Low-Rank Adaptation)
A fine-tuning technique that freezes pretrained weights and adds trainable low-rank matrices. Exploits the observation that weight updates during fine-tuning are approximately low-rank.
Low-Rank Approximation
Approximating a matrix \(W \approx AB\) where \(A\) and \(B\) have lower rank than \(W\). Reduces parameters from \(mn\) to \((m+n)r\) where \(r\) is the rank.
Memory-Bound
A workload where execution time is limited by the rate of data transfer rather than computation. Most ML inference workloads are memory-bound.
Memory Hierarchy
The organization of memory into levels of decreasing speed and increasing capacity: registers → L1 → L2 → L3 → DRAM → SSD → disk.
Mixture of Experts (MoE)
An architecture where only a subset of model parameters are activated for each input. Enables larger models without proportional compute increase.
Monoid
An algebraic structure with an associative operation and identity element. The theoretical foundation for parallel reductions and streaming algorithms.
Operator Fusion
Combining multiple operations into a single kernel to reduce memory traffic. A key optimization in deep learning compilers like torch.compile.
Prefetching
Loading data into cache before it’s needed, hiding memory latency. Hardware prefetchers detect streaming access patterns; software can provide explicit hints.
Quantization
Representing numbers with fewer bits (e.g., FP16 instead of FP32, or INT8 instead of FP16). Reduces memory footprint and bandwidth, sometimes with specialized hardware support.
Roofline Model
A visual model showing maximum achievable performance as a function of arithmetic intensity. Performance is bounded by min(peak FLOPS, bandwidth × intensity).
Separability
A property where a function can be decomposed into simpler, independent components. Enables factorization optimizations like separable filters and low-rank approximations.
SIMD
Single Instruction, Multiple Data. Processing multiple data elements with a single instruction. Essential for efficient vectorized code.
Sparsity
Having many zero or near-zero values in a matrix or tensor. Structured sparsity (e.g., 2:4 patterns) enables hardware acceleration.
SRAM (Static RAM)
Fast on-chip memory used for caches and GPU shared memory. Lower latency than DRAM but limited capacity. GPU shared memory is typically 100-200KB per SM.
Streaming Algorithm
An algorithm that processes data in a single pass, maintaining only a small summary state. Enabled by associative operations with bounded state.
Tensor Core
Specialized hardware for matrix multiply-accumulate operations. Achieves much higher throughput than standard FP32 cores but requires specific data formats.
Tiling
Dividing data into blocks that fit in fast memory, processing each tile entirely before moving to the next. Reduces memory traffic at the cost of code complexity.
Throughput
The number of operations completed per unit time. Contrast with latency (time per operation).
Working Set
The set of data actively being accessed. Performance depends on whether the working set fits in each cache level.

TipCross-Reference Guide
Concept Primary Chapter
Memory hierarchy Chapter 1
Bandwidth and roofline Chapter 2
Parallelism Chapter 3
Associativity and chunking Chapter 4
Separability and factoring Chapter 5
Sparsity and skipping Chapter 6
Locality and tiling Chapter 7
FlashAttention Chapter 10
LoRA Chapter 12
Quantization Chapter 13
Measurement Chapter 15