Glossary

A¶

Activation Recomputation (also: Gradient Checkpointing): Technique that discards intermediate activations during the forward pass and recomputes them during the backward pass, trading ~33% extra compute for significant memory savings.

AllGather: Collective operation where each process contributes a shard; result is the concatenation of all shards on all processes.

AllReduce: Collective operation combining values from all processes and distributing the result to all. Equivalent to ReduceScatter followed by AllGather for associative/commutative reductions (algorithmic equivalence, not a semantic inverse).

All-to-All (AlltoAll): Collective operation performing a transpose; each process sends different data to each other process.

Arithmetic Intensity: Ratio of FLOPs to bytes accessed from memory. Determines whether an operation is compute-bound or memory-bound.

B¶

Bubble: Idle time in pipeline parallelism caused by pipeline startup/teardown.

Bucket: Collection of small tensors aggregated for a single AllReduce to amortize latency.

Bus Bandwidth (busbw): Effective bandwidth metric that normalizes collective communication time by the algorithm-specific correction factor, allowing fair comparison across different collective operations and GPU counts.

C¶

Capacity Factor: In MoE models, the ratio of expert buffer size to the expected number of tokens per expert. A capacity factor of 1.0 means each expert can handle exactly its fair share; values >1.0 allow for routing imbalance at the cost of memory.

Chinchilla Scaling: Compute-optimal training where tokens ≈ 20× parameters.

Communication Intensity: Ratio of FLOPs to bytes communicated over network. Determines whether an operation is network-bound.

Context Parallelism (CP): Parallelism strategy splitting the sequence dimension across devices specifically for the attention computation, enabling long-context training (e.g., via Ring Attention).

Critical Batch Size: Batch size at which gradient noise equals gradient signal. Beyond this, larger batches yield diminishing returns.

D¶

Data Parallelism (DP): Parallelism strategy replicating the model across devices, splitting data. Synchronizes via gradient AllReduce.

Device Mesh: N-dimensional abstraction for organizing devices with different parallelism strategies along each axis.

DiLoCo: Distributed Low-Communication training. An approach where workers perform local SGD steps with infrequent global synchronization, reducing inter-node communication.

DualPipe: Advanced pipeline parallelism technique (DeepSeek-V3) that interleaves two pipelines to approximately halve the bubble fraction compared to standard 1F1B.

E¶

Expert Parallelism (EP): Parallelism strategy distributing MoE experts across devices.

F¶

FP8: 8-bit floating-point format for training. Two variants: E4M3 (4-bit exponent, 3-bit mantissa) for forward pass activations, and E5M2 (5-bit exponent, 2-bit mantissa) for gradients. Requires per-tensor or per-block scaling factors.

FlashAttention: IO-aware exact attention algorithm that avoids materializing the full \(O(S^2)\) attention matrix by tiling the computation to exploit GPU SRAM, reducing memory from \(O(S^2)\) to \(O(S)\).

FSDP: Fully Sharded Data Parallel. PyTorch's ZeRO-style sharding (supports multiple sharding modes, including ZeRO-3-like).

G¶

GQA (Grouped-Query Attention): Attention variant where multiple query heads share a single key-value head, reducing KV cache memory and parameter count while preserving quality. Used in LLaMA ⅔, Mistral, etc.

Gradient Accumulation: Computing gradients over multiple micro-batches before synchronization, effectively increasing batch size.

Gradient Checkpointing: See Activation Recomputation.

H¶

HBM: High Bandwidth Memory. GPU memory type with up to ~3 TB/s bandwidth (device-dependent).

L¶

LAMB/LARS: Layer-wise Adaptive Rate Scaling. Optimizer modifications for large batch training.

Loss Scaling: Technique in mixed-precision training where the loss is multiplied by a large factor before the backward pass to prevent small gradients from underflowing to zero in FP16, then unscaled before the optimizer step.

M¶

MFU: Model FLOP Utilization. Ratio of achieved model FLOP/s (useful computation only) to theoretical peak FLOP/s.

Micro-batch: Subdivision of batch for pipeline parallelism.

MLA (Multi-head Latent Attention): Attention variant (DeepSeek-V2/V3) that compresses key-value projections through a learned low-rank latent space, dramatically reducing KV cache memory.

MoE (Mixture of Experts): Architecture with multiple parallel FFN "experts" per layer, where a routing mechanism selects a sparse subset (top-k) of experts per token, enabling larger models without proportionally increasing compute.

N¶

NVLink: High-speed interconnect within a node. NVLink 4.0 (H100) provides ~900 GB/s bidirectional per GPU with NVSwitch; NVLink 3.0 (A100) provides ~600 GB/s.

P¶

Pipeline Parallelism (PP): Parallelism strategy splitting model layers across devices.

R¶

ReduceScatter: Collective operation reducing values and scattering shards to each process.

Ring Attention: Context parallelism technique that distributes key-value blocks across GPUs arranged in a logical ring. Each GPU computes attention against its local KV block while simultaneously sending/receiving blocks from neighbors, overlapping communication with computation.

Ring Algorithm: Bandwidth-optimal collective algorithm organizing processes in a logical ring.

RoPE (Rotary Position Embedding): Position encoding method applying rotation matrices to query and key vectors, enabling relative position awareness and length extrapolation.

Roofline Model: Performance analysis framework bounding throughput by compute ceiling, memory ceiling, or (extended) network ceiling.

S¶

Sequence Parallelism (SP): Parallelism strategy splitting along the sequence dimension.

Speculative Decoding: Inference optimization using a smaller "draft" model to generate candidate tokens that are verified in parallel by the larger model, improving latency without changing outputs.

SwiGLU: Activation function combining Swish and Gated Linear Unit: \(\text{SwiGLU}(x) = \text{Swish}(xW_1) \odot (xW_2)\). Widely adopted in modern LLMs (LLaMA, Mistral, DeepSeek) for improved quality over standard GELU, at the cost of an extra linear projection.

T¶

Tensor Parallelism (TP): Parallelism strategy splitting individual tensor operations (matrix multiplications) across devices.

Z¶

ZeRO: Zero Redundancy Optimizer. Memory optimization sharding optimizer states (ZeRO-1), gradients (ZeRO-2), and parameters (ZeRO-3) across data parallel ranks.