Notation and Minimal Formalism

This appendix defines the core symbols used throughout the book. When in doubt, refer back here.

Terminology and local aliases

We use All-to-All as the canonical prose term for the many-to-many transpose collective (you may also see API-style names such as AlltoAll or all_to_all in code). Chapter-level notation banners may introduce local aliases such as \(B_{\text{tok}} = B \cdot S\) for batch tokens while preserving the global symbols defined below.

Minimal Formalism¶

We model distributed training as the interaction of three resources:

Memory (what must fit on each device)
Compute (FLOPs per step per device)
Communication (bytes transferred across links)

The basic cost model is:

\[T = T_{\text{compute}} + T_{\text{comm}}\]

With overlap, the effective step time is:

\[T_{\text{step}} \approx \max(T_{\text{compute}}, T_{\text{comm}})\]

For communication, we use the α-β model:

\[T(n) = \alpha + \frac{n}{\beta}\]

Core Symbols (All Parts)¶

Symbol	Meaning	Typical Units
\(\Psi\)	Number of model parameters	count
\(D\)	Training tokens (dataset size)	tokens
\(C\)	Total compute budget	FLOPs
\(B\)	Global batch size (sequences)	sequences/step
\(b\)	Per-GPU batch size (sequences)	sequences/step
\(S\)	Sequence length	tokens
\(H\)	Hidden dimension	dimension
\(L\)	Number of transformer layers	count
\(A\)	Number of attention heads	count
\(V\)	Vocabulary size	count
\(P\)	Total number of GPUs / processes	count
\(F\)	Peak throughput per GPU	FLOP/s
MFU	Model FLOP Utilization	ratio
HFU	Hardware FLOP Utilization (includes recompute)	ratio
\(T\)	Time (context-dependent: step, total, etc.)	seconds
\(R\)	Hourly rate per GPU	USD/hr

Total tokens per step are \(B \cdot S\) unless stated otherwise.

Scaling Laws (Part II)¶

Symbol reuse

In Part II, \(\alpha\) and \(\beta\) denote scaling law exponents. From Part III onward, they denote network latency and bandwidth. The meaning should be clear from context.

Symbol	Meaning	Typical Values
\(\alpha\)	Parameter scaling exponent	0.34 (Chinchilla)
\(\beta\)	Data scaling exponent	0.28 (Chinchilla)
\(A, B\)	Scaling law constants	~400
\(L_\infty\)	Irreducible loss	~1.69 nats

Communication (Parts III–VIII)¶

Symbol	Meaning	Typical Units
\(\alpha\)	Network latency (per message)	seconds (~μs)
\(\beta\)	Network bandwidth	bytes/s
\(n\)	Message size	bytes
\(n^*\)	Crossover point (\(\alpha \cdot \beta\))	bytes
\(I_{\text{net}}\)	Communication intensity (FLOPs/byte communicated)	FLOPs/byte
\(I_{\text{mem}}\)	Memory intensity (FLOPs/byte from HBM)	FLOPs/byte
\(I_{\text{io}}\)	Data I/O intensity	FLOPs/byte
\(b_{\text{tok}}\)	Bytes per token (from storage)	bytes/token
\(\rho\)	Read amplification	ratio

Parallelism Dimensions (Parts IV–VI)¶

Symbol	Meaning	Typical Range
DP	Data parallelism degree	1–512
TP	Tensor parallelism degree	1–8 (within node)
PP	Pipeline parallelism degree	1–64
CP	Context (sequence) parallelism degree	1–32
EP	Expert parallelism degree	1–64
\(G\)	GPUs per node	4 or 8
\(N\)	Number of nodes (also: parameters in some legacy contexts)	1–1000s
\(p\)	Pipeline stages	count
\(m\)	Micro-batches per pipeline batch	count
\(v\)	Virtual pipeline stages (interleaved schedule)	count

Memory (Part V)¶

Symbol	Meaning	Notes
\(M_{\text{params}}\)	Memory for parameters	\(2\Psi\) bytes (FP16)
\(M_{\text{grad}}\)	Memory for gradients	\(2\Psi\) bytes (FP16)
\(M_{\text{opt}}\)	Memory for optimizer states	\(12\Psi\) bytes (AdamW FP32)
\(M_{\text{act}}\)	Activation memory	Depends on \(B, S, H, L\)
\(k\)	Checkpoint interval (layers between checkpoints)	\(\sqrt{L}\) optimal

Expert Parallelism (Part IV, Ch. 18)¶

Symbol	Meaning	Notes
\(E\)	Total number of experts	8–256
\(k\)	Number of active experts per token	1–4
\(C_f\)	Capacity factor	1.0–1.5
\(f_i\)	Fraction of tokens routed to expert \(i\)	ratio
\(p_i\)	Average router probability for expert \(i\)	ratio

Efficiency (Part VII)¶

Symbol	Meaning	Notes
\(w\)	Sliding window size	tokens
\(g\)	Number of KV heads (GQA)	\(g \leq A\)
\(r\)	Rank (PowerSGD) or repetition ratio	context-dependent
\(\tau\)	Staleness (async SGD)	steps
\(H\)	Sync interval (Local SGD)	steps
\(s\)	Quantization levels	count

Interpretation Tags¶

When you see labeled callouts, read them as follows:

Theory: Algorithmic lower bounds and idealized models.
Practice: Empirical ranges, framework behavior, and overheads.

Conventions¶

GB/TB are decimal (\(10^9\)/\(10^{12}\) bytes)
GiB/TiB are binary (\(2^{30}\)/\(2^{40}\) bytes)
Unless stated, FLOP/s assume dense FP16/BF16 on H100 SXM (~989 TFLOP/s)
See Hardware Assumptions for full accelerator specs