Quick Reference

A consolidated reference of the most-used formulas, decision tables, and environment variables from the book. Keep this page bookmarked.

Core Equations¶

Training Compute¶

\[C = 6\Psi D \qquad T_{\text{train}} = \frac{C}{P \cdot F \cdot \text{MFU}}\]

Memory (Mixed Precision, AdamW)¶

Component	Formula	70B Model
Parameters (FP16)	\(2\Psi\)	140 GB
Gradients (FP16)	\(2\Psi\)	140 GB
Optimizer (FP32)	\(12\Psi\)	840 GB
Total static	\(16\Psi\)	1.12 TB

Activation Memory (Per Layer, No Checkpointing)¶

\[M_{\text{act}}^{\text{layer}} \approx BSH \cdot \left(34 + 5\frac{AS}{H}\right) \text{ bytes}\]

With checkpointing every \(k\) layers (\(k^* = \sqrt{L}\) optimal):

\[M_{\text{act}}^{\text{ckpt}} \approx \frac{L}{k} \cdot 2BSH + k \cdot M_{\text{act}}^{\text{layer}}\]

Chinchilla Optimal Allocation¶

\[\Psi^* \approx \sqrt{\frac{C}{120}} \qquad D^* \approx 20\,\Psi^* \qquad \frac{D^*}{\Psi^*} \approx 20\]

Tokens per Second¶

\[\text{tokens/s} = \frac{P \cdot F \cdot \text{MFU}}{6\Psi}\]

Communication Cost Formulas (α-β Model)¶

All formulas use: \(P\) = processes, \(n\) = message size (bytes), \(\alpha\) = latency, \(\beta\) = bandwidth (bytes/s).

Collective	Latency Term	Bandwidth Term	Best For
AllReduce (ring)	\(2(P-1)\alpha\)	\(2\frac{P-1}{P}\frac{n}{\beta}\)	Large messages
AllReduce (tree)	\(2\log_2 P \cdot \alpha\)	\(2\log_2 P \cdot \frac{n}{\beta}\)	Small messages
AllReduce (RHD)	\(2\log_2 P \cdot \alpha\)	\(2\frac{P-1}{P}\frac{n}{\beta}\)	Power-of-2 \(P\)
ReduceScatter	\((P-1)\alpha\)	\(\frac{P-1}{P}\frac{n}{\beta}\)	ZeRO, FSDP
AllGather	\((P-1)\alpha\)	\(\frac{P-1}{P}\frac{n}{\beta}\)	ZeRO-3 forward
All-to-All	\((P-1)\alpha\)	\(\frac{P-1}{P}\frac{n}{\beta}\)	MoE routing
Broadcast (tree)	\(\log_2 P \cdot \alpha\)	\(\log_2 P \cdot \frac{n}{\beta}\)	Weight init

Hierarchical AllReduce (\(G\) GPUs/node, \(N\) nodes)¶

\[T_{\text{hier}} = \underbrace{2(G{-}1)\alpha_{\text{NV}} + 2\tfrac{G{-}1}{G}\tfrac{n}{\beta_{\text{NV}}}}_{\text{intra-node}} + \underbrace{2(N{-}1)\alpha_{\text{net}} + 2\tfrac{N{-}1}{N}\tfrac{n/G}{\beta_{\text{net}}}}_{\text{inter-node (1/G data)}}\]

Bus Bandwidth Correction Factors¶

Collective	Correction Factor
AllReduce	\(2(P-1)/P\)
AllGather / ReduceScatter / All-to-All (AlltoAll)	\((P-1)/P\)
Broadcast / Reduce	1

\[\text{busbw} = \text{algbw} \times \text{correction factor} \qquad \text{algbw} = n / T\]

Ridge Points (H100 SXM)¶

Link	Bandwidth	Ridge Point (\(F/\beta\))
NVLink 4.0	900 GB/s	~1,099 FLOPs/byte
InfiniBand NDR 400	50 GB/s	~19,780 FLOPs/byte

Parallelism Communication Per Step¶

Strategy	Volume Per GPU	Collective	Link
Data Parallel	\(\approx 2\Psi \cdot s\) bytes	AllReduce	Network
Tensor Parallel	\(4L \times BSH \cdot 2\) bytes	AllReduce	NVLink
Pipeline Parallel	\(\text{micro-batches} \times BSH \cdot 2\) bytes	Send/Recv	Network
ZeRO-3 (per layer)	\(2 \times n_{\text{layer}}\) bytes	AG + RS	Network

(\(s\) = bytes per element, e.g. 2 for FP16)

Pipeline Bubble Fractions¶

Schedule	Bubble Fraction	Memory (per GPU)
GPipe	\(\frac{p-1}{m+p-1}\)	\(O(m)\) activations
1F1B	\(\frac{p-1}{m+p-1}\)	\(O(p)\) activations
Interleaved (\(v\) virtual stages)	\(\frac{p-1}{m \cdot v + p - 1}\)	\(O(p)\) activations
Zero-Bubble (ZB-H1)	\(\approx 0\)	\(O(p)\) activations

\(p\) = pipeline stages, \(m\) = micro-batches per batch.

ZeRO Memory Savings¶

Stage	What's Sharded	Memory Per GPU	Extra Communication
Baseline (DDP)	Nothing	\(16\Psi\)	\(2\Psi s\) (AllReduce)
ZeRO-1	Optimizer states	\(4\Psi + 12\Psi/P\)	Same as DDP
ZeRO-2	+ Gradients	\(2\Psi + 14\Psi/P\)	Same as DDP
ZeRO-3	+ Parameters	\(16\Psi/P\)	\(3 \times 2\Psi s\) (AG + RS)

Parallelism Placement Decision Procedure¶

1. MEMORY FIT?
   Total static memory = 16Ψ/P_shard + activations
   If no → add ZeRO stages, activation checkpointing, or more PP/TP

2. TP DEGREE (within node, NVLink):
   TP = min(8, GPUs_per_node)  — typically 4 or 8
   Only increase if per-GPU params don't fit

3. PP DEGREE (across nodes, tolerates latency):
   PP = ceil(layers / max_layers_per_device)
   Use m ≥ 4×PP micro-batches to keep bubbles < 20%

4. DP DEGREE (outer dimension):
   DP = P / (TP × PP)
   Use gradient accumulation if DP × local_batch < B_crit

5. CP DEGREE (if sequence > 8K):
   CP = S / S_max_per_GPU
   Ring Attention or Ulysses depending on S and H

6. EP DEGREE (if MoE):
   EP = E / experts_per_GPU
   Ensure AlltoAll volume fits in network budget

Key NCCL Environment Variables¶

Variable	Purpose	Common Values
`NCCL_ALGO`	Force collective algorithm	`Ring`, `Tree`, `CollnetDirect`
`NCCL_PROTO`	Communication protocol	`Simple`, `LL`, `LL128`
`NCCL_TOPO_FILE`	Custom topology file	Path to XML
`NCCL_DEBUG`	Debug logging level	`INFO`, `WARN`, `TRACE`
`NCCL_IB_DISABLE`	Disable InfiniBand	`0` (default), `1`
`NCCL_P2P_LEVEL`	P2P communication level	`NVL`, `PIX`, `PHB`, `SYS`
`NCCL_NET_GDR_LEVEL`	GPUDirect RDMA level	`SYS`, `PHB`, `PIX`, `LOC`
`NCCL_SOCKET_IFNAME`	Network interface	`eth0`, `ib0`, etc.
`NCCL_BUFFSIZE`	Per-channel buffer size	Bytes (default 4 MB)

Common Sanity Checks¶

What to Check	Formula	Red Flag
MFU	\(\frac{6\Psi \cdot \text{tokens/s}}{P \cdot F}\)	< 30%
Comm fraction	\(T_{\text{comm}} / T_{\text{step}}\)	> 40% exposed
Pipeline bubble	\((p-1)/(m+p-1)\)	> 25%
Memory utilization	\(M_{\text{used}} / M_{\text{HBM}}\)	> 95% (fragmentation risk)
Data starvation	tokens/s vs \(PF \cdot \text{MFU} / (6\Psi)\)	Data loader < required