Notation and Minimal Formalism
This appendix defines the core symbols used throughout the book. When in doubt, refer back here.
Terminology and local aliases
We use All-to-All as the canonical prose term for the many-to-many transpose collective (you may also see API-style names such as AlltoAll or all_to_all in code).
Chapter-level notation banners may introduce local aliases such as \(B_{\text{tok}} = B \cdot S\) for batch tokens while preserving the global symbols defined below.
Minimal Formalism¶
We model distributed training as the interaction of three resources:
- Memory (what must fit on each device)
- Compute (FLOPs per step per device)
- Communication (bytes transferred across links)
The basic cost model is:
With overlap, the effective step time is:
For communication, we use the α-β model:
Core Symbols (All Parts)¶
| Symbol | Meaning | Typical Units |
|---|---|---|
| \(\Psi\) | Number of model parameters | count |
| \(D\) | Training tokens (dataset size) | tokens |
| \(C\) | Total compute budget | FLOPs |
| \(B\) | Global batch size (sequences) | sequences/step |
| \(b\) | Per-GPU batch size (sequences) | sequences/step |
| \(S\) | Sequence length | tokens |
| \(H\) | Hidden dimension | dimension |
| \(L\) | Number of transformer layers | count |
| \(A\) | Number of attention heads | count |
| \(V\) | Vocabulary size | count |
| \(P\) | Total number of GPUs / processes | count |
| \(F\) | Peak throughput per GPU | FLOP/s |
| MFU | Model FLOP Utilization | ratio |
| HFU | Hardware FLOP Utilization (includes recompute) | ratio |
| \(T\) | Time (context-dependent: step, total, etc.) | seconds |
| \(R\) | Hourly rate per GPU | USD/hr |
Total tokens per step are \(B \cdot S\) unless stated otherwise.
Scaling Laws (Part II)¶
Symbol reuse
In Part II, \(\alpha\) and \(\beta\) denote scaling law exponents. From Part III onward, they denote network latency and bandwidth. The meaning should be clear from context.
| Symbol | Meaning | Typical Values |
|---|---|---|
| \(\alpha\) | Parameter scaling exponent | 0.34 (Chinchilla) |
| \(\beta\) | Data scaling exponent | 0.28 (Chinchilla) |
| \(A, B\) | Scaling law constants | ~400 |
| \(L_\infty\) | Irreducible loss | ~1.69 nats |
Communication (Parts III–VIII)¶
| Symbol | Meaning | Typical Units |
|---|---|---|
| \(\alpha\) | Network latency (per message) | seconds (~μs) |
| \(\beta\) | Network bandwidth | bytes/s |
| \(n\) | Message size | bytes |
| \(n^*\) | Crossover point (\(\alpha \cdot \beta\)) | bytes |
| \(I_{\text{net}}\) | Communication intensity (FLOPs/byte communicated) | FLOPs/byte |
| \(I_{\text{mem}}\) | Memory intensity (FLOPs/byte from HBM) | FLOPs/byte |
| \(I_{\text{io}}\) | Data I/O intensity | FLOPs/byte |
| \(b_{\text{tok}}\) | Bytes per token (from storage) | bytes/token |
| \(\rho\) | Read amplification | ratio |
Parallelism Dimensions (Parts IV–VI)¶
| Symbol | Meaning | Typical Range |
|---|---|---|
| DP | Data parallelism degree | 1–512 |
| TP | Tensor parallelism degree | 1–8 (within node) |
| PP | Pipeline parallelism degree | 1–64 |
| CP | Context (sequence) parallelism degree | 1–32 |
| EP | Expert parallelism degree | 1–64 |
| \(G\) | GPUs per node | 4 or 8 |
| \(N\) | Number of nodes (also: parameters in some legacy contexts) | 1–1000s |
| \(p\) | Pipeline stages | count |
| \(m\) | Micro-batches per pipeline batch | count |
| \(v\) | Virtual pipeline stages (interleaved schedule) | count |
Memory (Part V)¶
| Symbol | Meaning | Notes |
|---|---|---|
| \(M_{\text{params}}\) | Memory for parameters | \(2\Psi\) bytes (FP16) |
| \(M_{\text{grad}}\) | Memory for gradients | \(2\Psi\) bytes (FP16) |
| \(M_{\text{opt}}\) | Memory for optimizer states | \(12\Psi\) bytes (AdamW FP32) |
| \(M_{\text{act}}\) | Activation memory | Depends on \(B, S, H, L\) |
| \(k\) | Checkpoint interval (layers between checkpoints) | \(\sqrt{L}\) optimal |
Expert Parallelism (Part IV, Ch. 18)¶
| Symbol | Meaning | Notes |
|---|---|---|
| \(E\) | Total number of experts | 8–256 |
| \(k\) | Number of active experts per token | 1–4 |
| \(C_f\) | Capacity factor | 1.0–1.5 |
| \(f_i\) | Fraction of tokens routed to expert \(i\) | ratio |
| \(p_i\) | Average router probability for expert \(i\) | ratio |
Efficiency (Part VII)¶
| Symbol | Meaning | Notes |
|---|---|---|
| \(w\) | Sliding window size | tokens |
| \(g\) | Number of KV heads (GQA) | \(g \leq A\) |
| \(r\) | Rank (PowerSGD) or repetition ratio | context-dependent |
| \(\tau\) | Staleness (async SGD) | steps |
| \(H\) | Sync interval (Local SGD) | steps |
| \(s\) | Quantization levels | count |
Interpretation Tags¶
When you see labeled callouts, read them as follows:
- Theory: Algorithmic lower bounds and idealized models.
- Practice: Empirical ranges, framework behavior, and overheads.
Conventions¶
- GB/TB are decimal (\(10^9\)/\(10^{12}\) bytes)
- GiB/TiB are binary (\(2^{30}\)/\(2^{40}\) bytes)
- Unless stated, FLOP/s assume dense FP16/BF16 on H100 SXM (~989 TFLOP/s)
- See Hardware Assumptions for full accelerator specs