The Compute-Loss Surface
Loss is a function of two investments: model size and training data. Understanding this surface is the first step to allocating compute efficiently.
The Question: You have a fixed compute budget C. Should you train a larger model on less data or a smaller model on more data? The loss surface tells us there's an optimal allocation—and most models before 2022 got it wrong.
Building On: Part I Foundations
This part assumes familiarity with the three walls (memory, time, cost) from Chapter 1, the extended roofline model from Chapter 2, and the estimation mindset from Chapter 6. We'll now ask: given that we must distribute training, how do we allocate our compute budget optimally between model size and training data?
The Empirical Discovery¶
In 2020, researchers at OpenAI made a remarkable observation: language model loss follows smooth, predictable power laws. Plot log-loss against log-parameters or log-tokens, and you get straight lines.
This isn't obvious. Complex systems often exhibit chaotic behavior. But neural language models, across many orders of magnitude, follow:
Where \(\Psi_c\), \(D_c\) are critical scales and \(\alpha_{\Psi}\), \(\alpha_D\) are power law exponents.
The Loss Surface¶
Combining both dependencies:
Or in the simpler additive form often used:
Where:
| Symbol | Meaning | Typical Value |
|---|---|---|
| \(\Psi\) | Number of parameters | \(10^6\) to \(10^{12}\) |
| \(D\) | Training tokens | \(10^9\) to \(10^{13}\) |
| \(A\) | Parameter scaling constant | ~400 |
| \(B\) | Data scaling constant | ~400 |
| \(\alpha\) | Parameter exponent | 0.076 (Kaplan) / 0.34 (Chinchilla) |
| \(\beta\) | Data exponent | 0.095 (Kaplan) / 0.28 (Chinchilla) |
| \(L_\infty\) | Irreducible loss | ~1.69 nats |
The irreducible loss \(L_\infty\) represents the entropy of natural language—even a perfect model can't predict the unpredictable.
The Compute Constraint¶
Training compute is dominated by matrix multiplications. For a transformer:
Derivation: Each token passes through layers where:
- Forward pass: ~\(2\Psi\) FLOPs (matrix multiply is \(2 \times\) parameters)
- Backward pass: ~\(4\Psi\) FLOPs (activation gradients + weight gradients)
- Total per token: ~\(6\Psi\) FLOPs
For \(D\) tokens:
Practice
If your budget is fixed, compute \((\Psi, D)\) pairs from \(C = 6\Psi D\) first, then decide whether you are training for loss (Chinchilla) or for inference cost (overtrain smaller models).
This creates a constraint surface in \((\Psi, D, C)\) space. For fixed \(C\), we get a hyperbola:
Minimizing Loss Under Compute Constraint¶
Problem: Minimize \(L(\Psi, D)\) subject to \(C = 6\Psi D\)
Intuition check
Before we derive the optimum: if you double your compute budget, should you spend most of it on a bigger model or more data? Make a prediction — we'll see Chinchilla's surprising answer below.
Method: Lagrange multipliers
The Lagrangian:
Taking partial derivatives:
From the first equation:
Substituting into the second:
Therefore:
This says: at the optimum, the marginal contribution to loss reduction from parameters equals that from data.
Rearranging:
The Optimal Allocation¶
Using \(C = 6\Psi D\) to eliminate one variable. Substituting \(D = C/(6\Psi)\) into the optimality condition \(\Psi^\alpha / D^\beta = A\alpha/(B\beta)\):
Similarly, substituting \(\Psi = C/(6D)\):
Key insight: Both \(\Psi^*\) and \(D^*\) are power laws in \(C\).
If \(\alpha \approx \beta\) (Chinchilla: \(\alpha \approx 0.34\), \(\beta \approx 0.28\)):
The optimal ratio:
Note that the ratio depends (weakly) on \(C\) unless \(\alpha = \beta\). For Chinchilla parameters, the exponent \((\alpha - \beta)/(\alpha + \beta) \approx 0.10\) is small, so the ratio varies slowly with compute budget. Empirically, Chinchilla found that \(D^*/\Psi^* \approx 20\) for the compute budgets they explored.
The 20:1 Rule: Optimal training uses ~20 tokens per parameter. This is an empirically observed ratio from Hoffmann et al. (2022), not a universal constant—it depends weakly on the compute budget \(C\). See Chapter 8 for a thorough reconciliation.
Kaplan vs Chinchilla¶
Two influential papers reached different conclusions:
| Paper | \(\alpha\) | \(\beta\) | Optimal \(\Psi^* \propto\) | Optimal \(D^* \propto\) | Tokens/Param |
|---|---|---|---|---|---|
| Kaplan (2020) | 0.076 | 0.095 | \(C^{0.73}\) | \(C^{0.27}\) | ~1.7 |
| Chinchilla (2022) | 0.34 | 0.28 | \(C^{0.50}\) | \(C^{0.50}\) | ~20 |
Note: The Kaplan optimal scaling exponents (0.73, 0.27) were empirically fit rather than derived from α and β. The theoretical derivation \(\Psi^* \propto C^{\beta/(\alpha+\beta)}\) gives different values, suggesting different fitting methodologies.
Why the difference?
Kaplan trained each model for a fixed number of steps, not to convergence. This systematically undertrained larger models, biasing the fit toward "make models bigger."
Chinchilla trained models to convergence at each size, revealing the true scaling relationship.
Visualizing the Surface¶
The loss surface can be visualized as contour lines:
L (loss)
│
2.5 ├─────────────────────────────
│ ╲
2.3 ├────╲────────────────────────
│ ╲ Iso-loss curves
2.1 ├──────╲──────────────────────
│ ╲
1.9 ├────────●────────────────────
│ ↗ ╲ Optimal path
1.7 ├─────●────╲──────────────────
│ ↗ ╲
└────┴───────┴────────────────→
10⁹ 10¹⁰ 10¹¹ \Psi
The optimal path traces the ridge where iso-compute lines are tangent to iso-loss curves.
Implications for Distributed Training¶
1. Training Efficiency Matters More Than Model Size¶
A 7B model trained on 2T tokens often outperforms a 70B model trained on 200B tokens, despite 10× fewer parameters.
For distributed systems: Efficient data pipelines (high throughput) can be more valuable than scaling to more GPUs for larger models.
2. Compute-Optimal Models Are Memory-Hungry¶
Chinchilla-optimal training means more data passes through the model. This increases:
- Activation memory during forward pass
- Gradient accumulation requirements
- Data loading bandwidth needs
3. The Inference-Training Trade-off¶
Chinchilla-optimal models are expensive to serve: same quality, bigger model, more inference FLOPs.
Many practitioners deliberately overtrain smaller models:
LLaMA models train on 1-2T tokens, far exceeding Chinchilla ratios, to reduce serving costs.
The Frontier Model Equation¶
Combining scaling laws with hardware:
Where MFU (Model FLOP Utilization) is typically 30-50%.
Example: Chinchilla (70B params, 1.4T tokens)
On 1000 H100s (~989 TFLOP/s dense FP16/BF16 each) at 40% MFU:
Beyond Simple Scaling¶
Recent research suggests the surface is more complex:
Data Quality¶
Where \(q\) is a data quality multiplier. High-quality data can shift the optimal ratio.
Architecture Efficiency¶
Different architectures have different \(A\) values. MoE models achieve lower loss at the same \(\Psi_{\text{active}}\).
Emergent Capabilities¶
Some capabilities emerge suddenly at scale, not following smooth power laws. The surface has discontinuities.
Exercises¶
- Optimal allocation: Given \(C = 10^{24}\) FLOPs and the empirical Chinchilla 20:1 rule (\(D^*/\Psi^* \approx 20\)), calculate the optimal model size and token count.
Solution
Using the 20:1 rule for Chinchilla-optimal allocation:
The optimal ratio is \(D^*/\Psi^* \approx 20\).
Substituting into the compute constraint \(C = 6\Psi D\):
Solving for optimal model size: $\(\Psi^* = \sqrt{\frac{C}{120}} = \sqrt{\frac{10^{24}}{120}} = \sqrt{8.33 \times 10^{21}}\)$
Optimal token count: $\(D^* = 20 \times \Psi^* = 20 \times 91 \times 10^9 = \boxed{1.82\text{T tokens}}\)$
Verification: $\(C = 6 \times 91 \times 10^9 \times 1.82 \times 10^{12} = 9.94 \times 10^{23} \approx 10^{24} \checkmark\)$
| Parameter | Value |
|---|---|
| Optimal \(\Psi^*\) | 91B |
| Optimal \(D^*\) | 1.82T |
| Tokens/parameter | 20 |
- Training time: You have 512 H100 GPUs at 45% MFU. How long to train the optimal model from Exercise 1?
Solution
Using the training time formula:
Given:
- \(C = 10^{24}\) FLOPs
- GPUs = 512
- H100 peak = 989 TFLOP/s = \(9.89 \times 10^{14}\) FLOP/s
- MFU = 45% = 0.45
Calculation:
Converting to days: $\(T = \frac{2.19 \times 10^6}{3600 \times 24} = \boxed{25.4 \text{ days}}\)$
Practical considerations:
| Factor | Impact |
|---|---|
| Checkpointing overhead | Add ~5-10% |
| Hardware failures | Plan for ~10% downtime |
| Realistic timeline | ~30-32 days |
- Overtraining analysis: LLaMA-2 7B was trained on 2T tokens.
- What's the Chinchilla-optimal token count for 7B parameters?
- By what factor is it overtrained?
- Estimate the loss difference between this and training 7B on optimal tokens.
Solution
Part 1: Chinchilla-optimal token count
Using the 20:1 rule:
Part 2: Overtraining factor
Part 3: Loss difference estimation
Using \(L(\Psi, D) = \frac{A}{\Psi^\alpha} + \frac{B}{D^\beta} + L_\infty\) with Chinchilla exponents (\(\alpha = 0.34\), \(\beta = 0.28\)):
The data-dependent term improvement:
Ratio of data terms:
The overtrained model achieves ~53% reduction in the data-dependent loss term compared to Chinchilla-optimal (the ratio 0.47 means paying only 47% of the data penalty).
Estimated improvement: ~0.1-0.2 nats lower loss from overtraining.
Key insight: Overtraining trades training compute for inference efficiency. LLaMA-2 7B performs comparably to a ~15-20B Chinchilla-optimal model while being 2-3× cheaper to serve.
- Iso-loss curve: Derive the equation for an iso-loss curve \(L(\Psi, D) = L_0\) in the \((\Psi, D)\) plane. What is its shape?
Solution
Starting from the loss equation:
Rearranging for the iso-loss curve:
Let \(L' = L_0 - L_\infty\) (the reducible loss):
Solving for \(D\) as a function of \(\Psi\):
Shape analysis:
| Property | Value |
|---|---|
| Vertical asymptote | \(\Psi = (A/L')^{1/\alpha}\) |
| Horizontal asymptote | \(D = (B/L')^{1/\beta}\) |
| Shape | Hyperbola-like curve in first quadrant |
| Curvature | Convex toward origin |
Geometric interpretation:
- For fixed loss \(L_0\), there's a family of \((\Psi, D)\) pairs that achieve it
- Larger models need less data to reach the same loss (and vice versa)
- The curve asymptotes show the minimum resources needed even with infinite investment in the other dimension
- Marginal returns: At the current training point \((\Psi_0, D_0)\), you can either double parameters or double data. Which reduces loss more? Derive the condition for indifference.
Solution
Loss reduction from doubling \(\Psi\):
Loss reduction from doubling \(D\):
Indifference condition (\(\Delta L_\Psi = \Delta L_D\)):
For Chinchilla (\(\alpha = \beta\), \(A = B\)):
The condition simplifies to:
But the 20:1 rule (\(D^*/\Psi^* = 20\)) suggests \(A \neq B\). The actual indifference point is at the Chinchilla optimum where marginal returns are equal.
Decision rule:
| Condition | Action |
|---|---|
| \(\frac{A(1-2^{-\alpha})}{\Psi_0^\alpha} > \frac{B(1-2^{-\beta})}{D_0^\beta}\) | Double parameters |
| \(\frac{A(1-2^{-\alpha})}{\Psi_0^\alpha} < \frac{B(1-2^{-\beta})}{D_0^\beta}\) | Double data |
| Equal | Either choice equivalent |
Practical implication: If you're at a Chinchilla-optimal point, doubling either has equal marginal benefit. Most pre-2022 models were undertrained on data, making doubling \(D\) more valuable.
Key Takeaways¶
-
Loss follows power laws in both parameters and data, enabling principled compute allocation.
-
The 20:1 rule: Chinchilla-optimal training uses ~20 tokens per parameter.
-
Optimal allocation: Both \(\Psi^*\) and \(D^*\) scale as \(C^{0.5}\)—double compute, \(\sqrt{2}\times\) each.
-
Marginal balance: At optimum, the last FLOP spent on parameters vs data yields equal loss reduction.
-
Practical trade-offs: Inference costs often push toward overtraining smaller models.