The Scale Imperative
In 2017, the largest language models could be trained on a single node or small cluster. By 2024, the largest models require thousands of GPUs running for months. This is not a temporary inconvenience—it is the new reality of machine learning.
The Question: A model has 70 billion parameters. Each parameter is 2 bytes (FP16). That's 140GB just for weights—before gradients, optimizer states, or activations. A single H100 has 80GB of memory. How do we train this model at all?
The Three Walls¶
Training large models hits three fundamental limits:
The Memory Wall¶
A model with \(\Psi\) parameters in mixed precision requires approximately:
Where:
- \(2\Psi\): FP16 parameters
- \(2\Psi\): FP16 gradients
- \(12\Psi\): FP32 optimizer states (Adam: master weights, first moment, second moment)
For a 70B model: \(16 \times 70 \times 10^9 = 1.12\text{TB}\)
No single accelerator holds this. We must distribute.
The Time Wall¶
Even if memory weren't a constraint, training time would be:
Where:
- \(\Psi\): parameters
- \(D\): training tokens
- \(F\): FLOP/s of the device
- Factor of 6: forward (2) + backward (4) FLOPs per parameter per token
For GPT-3 (175B parameters, 300B tokens) on a single H100 (~989 TFLOP/s dense FP16/BF16):
We need parallelism to finish in reasonable time.
The Cost Wall¶
GPU-hours are expensive. Inefficiency is waste. If we achieve only 40% of peak FLOP/s due to poor parallelization, we're burning 60% of our compute budget.
The economic imperative is clear: understand the mathematics of distributed training, or pay for ignorance.
The Thesis¶
This book's central claim:
Every parallelism strategy exploits a mathematical property of the computation. The optimal distribution follows from understanding which operations can be decomposed and which must be synchronized.
We will derive—not just describe—each parallelism strategy from the mathematical property that enables it:
| Strategy | Mathematical Property | Key Operation |
|---|---|---|
| Data Parallelism | Associativity | Gradient accumulation |
| Tensor Parallelism | Linearity | Matrix multiplication |
| Pipeline Parallelism | Separability | Layer composition |
| Sequence Parallelism | Decomposability | Attention computation |
| Expert Parallelism | Sparsity | Conditional routing |
Concepts at a Glance¶
This book introduces many specialized terms. Here's a preview of the core vocabulary—each concept receives full treatment in later chapters, but early familiarity helps:
| Concept | Intuition | Detailed Coverage |
|---|---|---|
| Data Parallelism | Replicate the model across GPUs; each processes different data; gradients are averaged | Chapter 14 |
| Tensor Parallelism | Split individual matrix operations across GPUs; requires fast interconnects | Chapter 15 |
| Pipeline Parallelism | Divide layers across GPUs; data flows through stages | Chapter 16 |
| AllReduce | A collective operation that sums values across all GPUs and distributes the result back to everyone | Chapter 11 |
| ZeRO | Memory optimization that shards optimizer states, gradients, or parameters across data-parallel replicas | Chapter 20 |
| Activation Checkpointing | Trade compute for memory by discarding intermediate activations and recomputing them during backpropagation | Chapter 21 |
| MFU (Model FLOP Utilization) | Fraction of theoretical peak FLOP/s actually achieved; the key efficiency metric | Chapter 13 |
| Mixed Precision | Use FP16/BF16 for speed while maintaining FP32 master weights for numerical stability | Chapter 30 (Part VII) |
Don't worry if these don't fully click yet—each will become concrete through derivations and examples.
The Extended Roofline¶
The original roofline model shows performance bounded by compute or memory bandwidth. For distributed training, we add a third ceiling: network bandwidth.
Performance (FLOP/s)
^
| _______________ Compute ceiling (peak FLOP/s)
| /
| /________________ Memory ceiling (bytes/s × arithmetic intensity)
| /
| /_________________ Network ceiling (bytes/s × communication intensity)
|/
+-----------------------> Arithmetic Intensity (FLOPs/byte)
Here, "intensity" refers to FLOPs per byte moved for the relevant resource (memory or network).
In practice, distributed training is often communication-bound at scale, though memory- and compute-bound regimes still appear depending on batch size and kernel mix. Understanding this changes everything about how we design systems.
What We'll Build¶
By the end of this book, you'll have:
- Mental models for reasoning about any distributed configuration
- Estimation skills to predict throughput from first principles
- Derivation ability to understand new techniques as they emerge
- Debugging intuition to identify bottlenecks quickly
Let's begin with the foundations.
Exercises¶
- Calculate the total memory required to train a 13B parameter model with mixed precision training using Adam optimizer. How many H100 GPUs (80GB each) would you need at minimum just to hold the model state?
Solution
Memory breakdown for mixed precision with Adam:
For 13B parameters:
Minimum GPUs needed:
In practice, you'd need more to account for activations, batch data, and framework overhead. A typical choice would be 4-8 GPUs.
- You want to train a 7B parameter model on 2 trillion tokens. Using a single H100 (~989 TFLOP/s dense FP16/BF16), how long would training take assuming 50% of peak utilization? Express your answer in days.
Solution
Training time formula:
Where \(\eta\) is the utilization factor (0.5).
Calculation:
Converting to days:
This is why we need hundreds of GPUs—to reduce this to weeks or months.
- A training run achieves 35% Model FLOP Utilization (MFU). If you're paying $2 per GPU-hour, what fraction of your compute budget is being "wasted" on inefficiency? If the total training cost is $10 million, how much money is lost to this inefficiency?
Solution
Efficiency analysis:
At 35% MFU, 65% of theoretical compute capacity is unused.
However, "waste" depends on what's achievable. State-of-the-art distributed training typically achieves 40-50% MFU due to fundamental overheads (communication, memory bandwidth limits, pipeline bubbles).
If we assume 50% MFU is achievable:
- Current efficiency: 35%
- Achievable efficiency: 50%
- Relative waste: \(\frac{50\% - 35\%}{50\%} = 30\%\)
Cost of inefficiency:
Key insight: Improving from 35% to 50% MFU would either save $3M or equivalently allow 43% more training for the same budget.
Key Takeaways¶
- Training compute is dominated by \(6\Psi D\): Parameter count and token count set the irreducible FLOP budget.
- Wall-clock time is a utilization problem: MFU and parallelism are the levers that turn years into weeks.
- Budget waste is often performance, not hardware: Small MFU gains translate to millions of dollars at scale.