Quick Reference
A consolidated reference of the most-used formulas, decision tables, and environment variables from the book. Keep this page bookmarked.
Core Equations¶
Training Compute¶
\[C = 6\Psi D \qquad T_{\text{train}} = \frac{C}{P \cdot F \cdot \text{MFU}}\]
Memory (Mixed Precision, AdamW)¶
| Component | Formula | 70B Model |
|---|---|---|
| Parameters (FP16) | \(2\Psi\) | 140 GB |
| Gradients (FP16) | \(2\Psi\) | 140 GB |
| Optimizer (FP32) | \(12\Psi\) | 840 GB |
| Total static | \(16\Psi\) | 1.12 TB |
Activation Memory (Per Layer, No Checkpointing)¶
\[M_{\text{act}}^{\text{layer}} \approx BSH \cdot \left(34 + 5\frac{AS}{H}\right) \text{ bytes}\]
With checkpointing every \(k\) layers (\(k^* = \sqrt{L}\) optimal):
\[M_{\text{act}}^{\text{ckpt}} \approx \frac{L}{k} \cdot 2BSH + k \cdot M_{\text{act}}^{\text{layer}}\]
Chinchilla Optimal Allocation¶
\[\Psi^* \approx \sqrt{\frac{C}{120}} \qquad D^* \approx 20\,\Psi^* \qquad \frac{D^*}{\Psi^*} \approx 20\]
Tokens per Second¶
\[\text{tokens/s} = \frac{P \cdot F \cdot \text{MFU}}{6\Psi}\]
Communication Cost Formulas (α-β Model)¶
All formulas use: \(P\) = processes, \(n\) = message size (bytes), \(\alpha\) = latency, \(\beta\) = bandwidth (bytes/s).
| Collective | Latency Term | Bandwidth Term | Best For |
|---|---|---|---|
| AllReduce (ring) | \(2(P-1)\alpha\) | \(2\frac{P-1}{P}\frac{n}{\beta}\) | Large messages |
| AllReduce (tree) | \(2\log_2 P \cdot \alpha\) | \(2\log_2 P \cdot \frac{n}{\beta}\) | Small messages |
| AllReduce (RHD) | \(2\log_2 P \cdot \alpha\) | \(2\frac{P-1}{P}\frac{n}{\beta}\) | Power-of-2 \(P\) |
| ReduceScatter | \((P-1)\alpha\) | \(\frac{P-1}{P}\frac{n}{\beta}\) | ZeRO, FSDP |
| AllGather | \((P-1)\alpha\) | \(\frac{P-1}{P}\frac{n}{\beta}\) | ZeRO-3 forward |
| All-to-All | \((P-1)\alpha\) | \(\frac{P-1}{P}\frac{n}{\beta}\) | MoE routing |
| Broadcast (tree) | \(\log_2 P \cdot \alpha\) | \(\log_2 P \cdot \frac{n}{\beta}\) | Weight init |
Hierarchical AllReduce (\(G\) GPUs/node, \(N\) nodes)¶
\[T_{\text{hier}} = \underbrace{2(G{-}1)\alpha_{\text{NV}} + 2\tfrac{G{-}1}{G}\tfrac{n}{\beta_{\text{NV}}}}_{\text{intra-node}} + \underbrace{2(N{-}1)\alpha_{\text{net}} + 2\tfrac{N{-}1}{N}\tfrac{n/G}{\beta_{\text{net}}}}_{\text{inter-node (1/G data)}}\]
Bus Bandwidth Correction Factors¶
| Collective | Correction Factor |
|---|---|
| AllReduce | \(2(P-1)/P\) |
| AllGather / ReduceScatter / All-to-All (AlltoAll) | \((P-1)/P\) |
| Broadcast / Reduce | 1 |
\[\text{busbw} = \text{algbw} \times \text{correction factor} \qquad \text{algbw} = n / T\]
Ridge Points (H100 SXM)¶
| Link | Bandwidth | Ridge Point (\(F/\beta\)) |
|---|---|---|
| NVLink 4.0 | 900 GB/s | ~1,099 FLOPs/byte |
| InfiniBand NDR 400 | 50 GB/s | ~19,780 FLOPs/byte |
Parallelism Communication Per Step¶
| Strategy | Volume Per GPU | Collective | Link |
|---|---|---|---|
| Data Parallel | \(\approx 2\Psi \cdot s\) bytes | AllReduce | Network |
| Tensor Parallel | \(4L \times BSH \cdot 2\) bytes | AllReduce | NVLink |
| Pipeline Parallel | \(\text{micro-batches} \times BSH \cdot 2\) bytes | Send/Recv | Network |
| ZeRO-3 (per layer) | \(2 \times n_{\text{layer}}\) bytes | AG + RS | Network |
(\(s\) = bytes per element, e.g. 2 for FP16)
Pipeline Bubble Fractions¶
| Schedule | Bubble Fraction | Memory (per GPU) |
|---|---|---|
| GPipe | \(\frac{p-1}{m+p-1}\) | \(O(m)\) activations |
| 1F1B | \(\frac{p-1}{m+p-1}\) | \(O(p)\) activations |
| Interleaved (\(v\) virtual stages) | \(\frac{p-1}{m \cdot v + p - 1}\) | \(O(p)\) activations |
| Zero-Bubble (ZB-H1) | \(\approx 0\) | \(O(p)\) activations |
\(p\) = pipeline stages, \(m\) = micro-batches per batch.
ZeRO Memory Savings¶
| Stage | What's Sharded | Memory Per GPU | Extra Communication |
|---|---|---|---|
| Baseline (DDP) | Nothing | \(16\Psi\) | \(2\Psi s\) (AllReduce) |
| ZeRO-1 | Optimizer states | \(4\Psi + 12\Psi/P\) | Same as DDP |
| ZeRO-2 | + Gradients | \(2\Psi + 14\Psi/P\) | Same as DDP |
| ZeRO-3 | + Parameters | \(16\Psi/P\) | \(3 \times 2\Psi s\) (AG + RS) |
Parallelism Placement Decision Procedure¶
1. MEMORY FIT?
Total static memory = 16Ψ/P_shard + activations
If no → add ZeRO stages, activation checkpointing, or more PP/TP
2. TP DEGREE (within node, NVLink):
TP = min(8, GPUs_per_node) — typically 4 or 8
Only increase if per-GPU params don't fit
3. PP DEGREE (across nodes, tolerates latency):
PP = ceil(layers / max_layers_per_device)
Use m ≥ 4×PP micro-batches to keep bubbles < 20%
4. DP DEGREE (outer dimension):
DP = P / (TP × PP)
Use gradient accumulation if DP × local_batch < B_crit
5. CP DEGREE (if sequence > 8K):
CP = S / S_max_per_GPU
Ring Attention or Ulysses depending on S and H
6. EP DEGREE (if MoE):
EP = E / experts_per_GPU
Ensure AlltoAll volume fits in network budget
Key NCCL Environment Variables¶
| Variable | Purpose | Common Values |
|---|---|---|
NCCL_ALGO |
Force collective algorithm | Ring, Tree, CollnetDirect |
NCCL_PROTO |
Communication protocol | Simple, LL, LL128 |
NCCL_TOPO_FILE |
Custom topology file | Path to XML |
NCCL_DEBUG |
Debug logging level | INFO, WARN, TRACE |
NCCL_IB_DISABLE |
Disable InfiniBand | 0 (default), 1 |
NCCL_P2P_LEVEL |
P2P communication level | NVL, PIX, PHB, SYS |
NCCL_NET_GDR_LEVEL |
GPUDirect RDMA level | SYS, PHB, PIX, LOC |
NCCL_SOCKET_IFNAME |
Network interface | eth0, ib0, etc. |
NCCL_BUFFSIZE |
Per-channel buffer size | Bytes (default 4 MB) |
Common Sanity Checks¶
| What to Check | Formula | Red Flag |
|---|---|---|
| MFU | \(\frac{6\Psi \cdot \text{tokens/s}}{P \cdot F}\) | < 30% |
| Comm fraction | \(T_{\text{comm}} / T_{\text{step}}\) | > 40% exposed |
| Pipeline bubble | \((p-1)/(m+p-1)\) | > 25% |
| Memory utilization | \(M_{\text{used}} / M_{\text{HBM}}\) | > 95% (fragmentation risk) |
| Data starvation | tokens/s vs \(PF \cdot \text{MFU} / (6\Psi)\) | Data loader < required |