Skip to content

The Scale Imperative

In 2017, the largest language models could be trained on a single node or small cluster. By 2024, the largest models require thousands of GPUs running for months. This is not a temporary inconvenience—it is the new reality of machine learning.

The Question: A model has 70 billion parameters. Each parameter is 2 bytes (FP16). That's 140GB just for weights—before gradients, optimizer states, or activations. A single H100 has 80GB of memory. How do we train this model at all?

Notation in this chapter: \(\Psi\) = parameters, \(D\) = training tokens, \(B\) = global batch size (sequences), \(B_{\text{tok}} = B \cdot S\) = batch tokens, \(P\) = GPUs. See Notation.

The Three Walls

Training large models hits three fundamental limits:

The Memory Wall

A model with \(\Psi\) parameters in mixed precision requires approximately:

\[\text{Memory} = 2\Psi + 2\Psi + 12\Psi = 16\Psi \text{ bytes}\]

Where:

  • \(2\Psi\): FP16 parameters
  • \(2\Psi\): FP16 gradients
  • \(12\Psi\): FP32 optimizer states (Adam: master weights, first moment, second moment)

For a 70B model: \(16 \times 70 \times 10^9 = 1.12\text{TB}\)

No single accelerator holds this. We must distribute.

The Time Wall

Even if memory weren't a constraint, training time would be:

\[T = \frac{6 \cdot \Psi \cdot D}{F}\]

Where:

  • \(\Psi\): parameters
  • \(D\): training tokens
  • \(F\): FLOP/s of the device
  • Factor of 6: forward (2) + backward (4) FLOPs per parameter per token

For GPT-3 (175B parameters, 300B tokens) on a single H100 (~989 TFLOP/s dense FP16/BF16):

\[T = \frac{6 \times 175 \times 10^9 \times 300 \times 10^9}{989 \times 10^{12}} = 318 \times 10^6 \text{ seconds} \approx 10 \text{ years}\]

We need parallelism to finish in reasonable time.

The Cost Wall

GPU-hours are expensive. Inefficiency is waste. If we achieve only 40% of peak FLOP/s due to poor parallelization, we're burning 60% of our compute budget.

The economic imperative is clear: understand the mathematics of distributed training, or pay for ignorance.

The Thesis

This book's central claim:

Every parallelism strategy exploits a mathematical property of the computation. The optimal distribution follows from understanding which operations can be decomposed and which must be synchronized.

We will derive—not just describe—each parallelism strategy from the mathematical property that enables it:

Strategy Mathematical Property Key Operation
Data Parallelism Associativity Gradient accumulation
Tensor Parallelism Linearity Matrix multiplication
Pipeline Parallelism Separability Layer composition
Sequence Parallelism Decomposability Attention computation
Expert Parallelism Sparsity Conditional routing

Concepts at a Glance

This book introduces many specialized terms. Here's a preview of the core vocabulary—each concept receives full treatment in later chapters, but early familiarity helps:

Concept Intuition Detailed Coverage
Data Parallelism Replicate the model across GPUs; each processes different data; gradients are averaged Chapter 14
Tensor Parallelism Split individual matrix operations across GPUs; requires fast interconnects Chapter 15
Pipeline Parallelism Divide layers across GPUs; data flows through stages Chapter 16
AllReduce A collective operation that sums values across all GPUs and distributes the result back to everyone Chapter 11
ZeRO Memory optimization that shards optimizer states, gradients, or parameters across data-parallel replicas Chapter 20
Activation Checkpointing Trade compute for memory by discarding intermediate activations and recomputing them during backpropagation Chapter 21
MFU (Model FLOP Utilization) Fraction of theoretical peak FLOP/s actually achieved; the key efficiency metric Chapter 13
Mixed Precision Use FP16/BF16 for speed while maintaining FP32 master weights for numerical stability Chapter 30 (Part VII)

Don't worry if these don't fully click yet—each will become concrete through derivations and examples.

The Extended Roofline

The original roofline model shows performance bounded by compute or memory bandwidth. For distributed training, we add a third ceiling: network bandwidth.

Performance (FLOP/s)
       ^
       |     _______________  Compute ceiling (peak FLOP/s)
       |    /
       |   /________________  Memory ceiling (bytes/s × arithmetic intensity)
       |  /
       | /_________________  Network ceiling (bytes/s × communication intensity)
       |/
       +-----------------------> Arithmetic Intensity (FLOPs/byte)

Here, "intensity" refers to FLOPs per byte moved for the relevant resource (memory or network).

In practice, distributed training is often communication-bound at scale, though memory- and compute-bound regimes still appear depending on batch size and kernel mix. Understanding this changes everything about how we design systems.

What We'll Build

By the end of this book, you'll have:

  1. Mental models for reasoning about any distributed configuration
  2. Estimation skills to predict throughput from first principles
  3. Derivation ability to understand new techniques as they emerge
  4. Debugging intuition to identify bottlenecks quickly

Let's begin with the foundations.

Exercises

  1. Calculate the total memory required to train a 13B parameter model with mixed precision training using Adam optimizer. How many H100 GPUs (80GB each) would you need at minimum just to hold the model state?
Solution

Memory breakdown for mixed precision with Adam:

\[\text{Memory} = 2\Psi + 2\Psi + 12\Psi = 16\Psi \text{ bytes}\]

For 13B parameters:

\[\text{Memory} = 16 \times 13 \times 10^9 = 208 \text{ GB}\]

Minimum GPUs needed:

\[\text{GPUs} = \lceil \frac{208 \text{ GB}}{80 \text{ GB}} \rceil = 3 \text{ GPUs}\]

In practice, you'd need more to account for activations, batch data, and framework overhead. A typical choice would be 4-8 GPUs.

  1. You want to train a 7B parameter model on 2 trillion tokens. Using a single H100 (~989 TFLOP/s dense FP16/BF16), how long would training take assuming 50% of peak utilization? Express your answer in days.
Solution

Training time formula:

\[T = \frac{6 \cdot \Psi \cdot D}{F \cdot \eta}\]

Where \(\eta\) is the utilization factor (0.5).

Calculation:

\[T = \frac{6 \times 7 \times 10^9 \times 2 \times 10^{12}}{989 \times 10^{12} \times 0.5}\]
\[T = \frac{84 \times 10^{21}}{494.5 \times 10^{12}} = 169.8 \times 10^6 \text{ seconds}\]

Converting to days:

\[T = \frac{169.8 \times 10^6}{86400} \approx 1965 \text{ days} \approx 5.4 \text{ years}\]

This is why we need hundreds of GPUs—to reduce this to weeks or months.

  1. A training run achieves 35% Model FLOP Utilization (MFU). If you're paying $2 per GPU-hour, what fraction of your compute budget is being "wasted" on inefficiency? If the total training cost is $10 million, how much money is lost to this inefficiency?
Solution

Efficiency analysis:

At 35% MFU, 65% of theoretical compute capacity is unused.

However, "waste" depends on what's achievable. State-of-the-art distributed training typically achieves 40-50% MFU due to fundamental overheads (communication, memory bandwidth limits, pipeline bubbles).

If we assume 50% MFU is achievable:

  • Current efficiency: 35%
  • Achievable efficiency: 50%
  • Relative waste: \(\frac{50\% - 35\%}{50\%} = 30\%\)

Cost of inefficiency:

\[\text{Wasted cost} = \$10M \times 0.30 = \$3M\]

Key insight: Improving from 35% to 50% MFU would either save $3M or equivalently allow 43% more training for the same budget.

Key Takeaways

  1. Training compute is dominated by \(6\Psi D\): Parameter count and token count set the irreducible FLOP budget.
  2. Wall-clock time is a utilization problem: MFU and parallelism are the levers that turn years into weeks.
  3. Budget waste is often performance, not hardware: Small MFU gains translate to millions of dollars at scale.