The Scale Imperative

In 2017, the largest language models could be trained on a single node or small cluster. By 2024, the largest models require thousands of GPUs running for months. This is not a temporary inconvenience—it is the new reality of machine learning.

The Question: A model has 70 billion parameters. Each parameter is 2 bytes (FP16). That's 140GB just for weights—before gradients, optimizer states, or activations. A single H100 has 80GB of memory. How do we train this model at all?

The Three Walls¶

Training large models hits three fundamental limits:

The Memory Wall¶

A model with $\Psi$ parameters in mixed precision requires approximately:

\[\text{Memory} = 2\Psi + 2\Psi + 12\Psi = 16\Psi \text{ bytes}\]

Where:

$2\Psi$: FP16 parameters
$2\Psi$: FP16 gradients
$12\Psi$: FP32 optimizer states (Adam: master weights, first moment, second moment)

For a 70B model: $16 \times 70 \times 10^9 = 1.12\text{TB}$

No single accelerator holds this. We must distribute.

The Time Wall¶

Even if memory weren't a constraint, training time would be:

\[T = \frac{6 \cdot \Psi \cdot D}{F}\]

Where:

$\Psi$: parameters
$D$: training tokens
$F$: FLOP/s of the device
Factor of 6: forward (2) + backward (4) FLOPs per parameter per token

For GPT-3 (175B parameters, 300B tokens) on a single H100 (~989 TFLOP/s dense FP16/BF16):

\[T = \frac{6 \times 175 \times 10^9 \times 300 \times 10^9}{989 \times 10^{12}} = 318 \times 10^6 \text{ seconds} \approx 10 \text{ years}\]

We need parallelism to finish in reasonable time.

The Cost Wall¶

GPU-hours are expensive. Inefficiency is waste. If we achieve only 40% of peak FLOP/s due to poor parallelization, we're burning 60% of our compute budget.

The economic imperative is clear: understand the mathematics of distributed training, or pay for ignorance.

The Thesis¶

This book's central claim:

Every parallelism strategy exploits a mathematical property of the computation. The optimal distribution follows from understanding which operations can be decomposed and which must be synchronized.

We will derive—not just describe—each parallelism strategy from the mathematical property that enables it:

Strategy	Mathematical Property	Key Operation
Data Parallelism	Associativity	Gradient accumulation
Tensor Parallelism	Linearity	Matrix multiplication
Pipeline Parallelism	Separability	Layer composition
Sequence Parallelism	Decomposability	Attention computation
Expert Parallelism	Sparsity	Conditional routing

Concepts at a Glance¶

This book introduces many specialized terms. Here's a preview of the core vocabulary—each concept receives full treatment in later chapters, but early familiarity helps:

Concept	Intuition	Detailed Coverage
Data Parallelism	Replicate the model across GPUs; each processes different data; gradients are averaged	Chapter 14
Tensor Parallelism	Split individual matrix operations across GPUs; requires fast interconnects	Chapter 15
Pipeline Parallelism	Divide layers across GPUs; data flows through stages	Chapter 16
AllReduce	A collective operation that sums values across all GPUs and distributes the result back to everyone	Chapter 11
ZeRO	Memory optimization that shards optimizer states, gradients, or parameters across data-parallel replicas	Chapter 20
Activation Checkpointing	Trade compute for memory by discarding intermediate activations and recomputing them during backpropagation	Chapter 21
MFU (Model FLOP Utilization)	Fraction of theoretical peak FLOP/s actually achieved; the key efficiency metric	Chapter 13
Mixed Precision	Use FP16/BF16 for speed while maintaining FP32 master weights for numerical stability	Chapter 30 (Part VII)

Don't worry if these don't fully click yet—each will become concrete through derivations and examples.

The Extended Roofline¶

The original roofline model shows performance bounded by compute or memory bandwidth. For distributed training, we add a third ceiling: network bandwidth.

Performance (FLOP/s)
       ^
       |     _______________  Compute ceiling (peak FLOP/s)
       |    /
       |   /________________  Memory ceiling (bytes/s × arithmetic intensity)
       |  /
       | /_________________  Network ceiling (bytes/s × communication intensity)
       |/
       +-----------------------> Arithmetic Intensity (FLOPs/byte)

Here, "intensity" refers to FLOPs per byte moved for the relevant resource (memory or network).

In practice, distributed training is often communication-bound at scale, though memory- and compute-bound regimes still appear depending on batch size and kernel mix. Understanding this changes everything about how we design systems.

What We'll Build¶

By the end of this book, you'll have:

Mental models for reasoning about any distributed configuration
Estimation skills to predict throughput from first principles
Derivation ability to understand new techniques as they emerge
Debugging intuition to identify bottlenecks quickly

Let's begin with the foundations.

Exercises¶

Calculate the total memory required to train a 13B parameter model with mixed precision training using Adam optimizer. How many H100 GPUs (80GB each) would you need at minimum just to hold the model state?

Solution

Memory breakdown for mixed precision with Adam:

\[\text{Memory} = 2\Psi + 2\Psi + 12\Psi = 16\Psi \text{ bytes}\]

For 13B parameters:

\[\text{Memory} = 16 \times 13 \times 10^9 = 208 \text{ GB}\]

Minimum GPUs needed:

\[\text{GPUs} = \lceil \frac{208 \text{ GB}}{80 \text{ GB}} \rceil = 3 \text{ GPUs}\]

In practice, you'd need more to account for activations, batch data, and framework overhead. A typical choice would be 4-8 GPUs.

You want to train a 7B parameter model on 2 trillion tokens. Using a single H100 (~989 TFLOP/s dense FP16/BF16), how long would training take assuming 50% of peak utilization? Express your answer in days.

Solution

Training time formula:

\[T = \frac{6 \cdot \Psi \cdot D}{F \cdot \eta}\]

Where $\eta$ is the utilization factor (0.5).

Calculation:

\[T = \frac{6 \times 7 \times 10^9 \times 2 \times 10^{12}}{989 \times 10^{12} \times 0.5}\]

\[T = \frac{84 \times 10^{21}}{494.5 \times 10^{12}} = 169.8 \times 10^6 \text{ seconds}\]

Converting to days:

\[T = \frac{169.8 \times 10^6}{86400} \approx 1965 \text{ days} \approx 5.4 \text{ years}\]

This is why we need hundreds of GPUs—to reduce this to weeks or months.

A training run achieves 35% Model FLOP Utilization (MFU). If you're paying $2 per GPU-hour, what fraction of your compute budget is being "wasted" on inefficiency? If the total training cost is $10 million, how much money is lost to this inefficiency?

Solution

Efficiency analysis:

At 35% MFU, 65% of theoretical compute capacity is unused.

However, "waste" depends on what's achievable. State-of-the-art distributed training typically achieves 40-50% MFU due to fundamental overheads (communication, memory bandwidth limits, pipeline bubbles).

If we assume 50% MFU is achievable:

Current efficiency: 35%
Achievable efficiency: 50%
Relative waste: $\frac{50\% - 35\%}{50\%} = 30\%$

Cost of inefficiency:

\[\text{Wasted cost} = \$10M \times 0.30 = \$3M\]

Key insight: Improving from 35% to 50% MFU would either save $3M or equivalently allow 43% more training for the same budget.

Key Takeaways¶

Training compute is dominated by $6\Psi D$: Parameter count and token count set the irreducible FLOP budget.
Wall-clock time is a utilization problem: MFU and parallelism are the levers that turn years into weeks.
Budget waste is often performance, not hardware: Small MFU gains translate to millions of dollars at scale.