Chinchilla Optimality

For years, the field scaled models without enough data. Chinchilla revealed the optimal balance: approximately 20 tokens per parameter.

The Question: GPT-3 has 175B parameters but trained on only 300B tokens—a ratio of 1.7:1. Chinchilla, with 70B parameters and 1.4T tokens (20:1), achieved better loss than GPT-3 with comparable compute ($5.9 \times 10^{23}$ vs $3.15 \times 10^{23}$ FLOPs). A compute-optimal model at GPT-3's budget could have matched its loss at a fraction of the size. What went wrong, and how do we compute "optimal"?

The Pre-Chinchilla Era¶

Before 2022, the dominant scaling strategy was: make the model bigger.

This belief came from Kaplan et al. (2020), which suggested:

Model size should scale as $\Psi \propto C^{0.73}$
Data should scale as $D \propto C^{0.27}$

Under this prescription, doubling compute meant:

1.66× more parameters
1.20× more tokens

The result: increasingly undertrained models.

Model	Parameters	Tokens	Tokens/Param
GPT-2	1.5B	40B	26.7
GPT-3	175B	300B	1.7
Gopher	280B	300B	1.1
Megatron-Turing	530B	270B	0.5

Each model was larger than its predecessor but trained on roughly the same data. By Megatron-Turing, models saw each token only 0.5 times on average.

The Chinchilla Methodology¶

Hoffmann et al. (2022) approached the problem differently. They trained 400+ models across a wide range of sizes (70M to 16B) and token counts, each to convergence.

Three independent methods converged on the same answer:

Method 1: Fixed Model, Varying Data¶

For each model size $\Psi$, fit:

\[L(D) = \frac{B}{D^\beta} + L_\infty(\Psi)\]

Extract $L_\infty(\Psi)$ for each size, then fit:

\[L_\infty(\Psi) = \frac{A}{\Psi^\alpha}\]

Method 2: IsoFLOP Curves¶

Fix compute budget $C$. Train many $(\Psi, D)$ pairs satisfying $C = 6\Psi D$. Plot loss vs $\Psi$, find minimum.

For each $C$, there's an optimal $\Psi^*(C)$:

Loss
  │
  │  ●                              C = 10²¹
  │   ●  ●
  │      ↘  ●
  │         ↘ ● ← Optimal \Psi
  │            ● ●
  │               ●
  └──────────────────────────────→ \Psi

Method 3: Parametric Fitting¶

Fit all data simultaneously to:

\[L(\Psi, D) = \frac{A}{\Psi^\alpha} + \frac{B}{D^\beta} + L_\infty\]

Where $L_\infty$ is the irreducible loss (the minimum achievable loss with infinite compute). With Lagrange constraint $C = 6\Psi D$, derive optimal scaling.

All three methods agreed: $\Psi^* \propto C^{0.50}$, $D^* \propto C^{0.50}$.

The 20:1 Ratio Derivation¶

From the optimal allocation (Chapter 7):

\[\frac{D^*}{\Psi^*} = \frac{B\beta}{A\alpha}\]

Substituting Chinchilla's fitted values:

$A = 406.4$, $\alpha = 0.34$
$B = 410.7$, $\beta = 0.28$

\[\frac{D^*}{\Psi^*} = \frac{410.7 \times 0.28}{406.4 \times 0.34} = \frac{115.0}{138.2} \approx 0.83\]

Wait—that's less than 1, not 20. What gives?

The resolution: The values above are for the multiplicative form. The commonly quoted "20:1" comes from a different parameterization and fitting procedure.

More precisely, from the paper:

\[\Psi_{\text{opt}} = 0.6225 \cdot C^{0.4957}\]

\[D_{\text{opt}} = 1.8421 \cdot C^{0.5043}\]

At $C = 10^{21}$ FLOPs:

$\Psi_{\text{opt}} \approx 1.9 \times 10^{10}$ (19B)
$D_{\text{opt}} \approx 4.0 \times 10^{11}$ (400B)
Ratio: $D/\Psi \approx 21$

The 20:1 rule is a useful approximation:

\[D_{\text{optimal}} \approx 20 \cdot \Psi\]

Computing Optimal Allocations¶

Given compute budget $C$:

Step 1: Estimate optimal parameters $$\Psi^* \approx \sqrt{\frac{C}{6 \times 20}} = \sqrt{\frac{C}{120}}$$

Step 2: Estimate optimal tokens $$D^* \approx 20 \cdot \Psi^*$$

Worked Example: $C = 10^{24}$ FLOPs (≈Chinchilla-scale training budget)

\[\Psi^* \approx \sqrt{\frac{10^{24}}{120}} = \sqrt{8.33 \times 10^{21}} \approx 9.1 \times 10^{10}\]

\[D^* \approx 20 \times 9.1 \times 10^{10} = 1.82 \times 10^{12}\]

Compute-optimal: ~91B parameters, ~1.82T tokens.

Reported estimates vary; one commonly cited rumor is 1.8T parameters (mixture-of-experts) trained on 13T tokens. Treat this as speculative. If true, it would be heavily overtrained relative to Chinchilla—deliberately so for inference efficiency.

The Undertrained Models¶

How "wrong" were pre-Chinchilla models?

GPT-3 Analysis¶

Parameters: $\Psi = 175 \times 10^9$
Tokens: $D = 300 \times 10^9$
Compute: $C = 6\Psi D = 3.15 \times 10^{23}$

Chinchilla-optimal for this compute:

\[\Psi^* = \sqrt{\frac{3.15 \times 10^{23}}{120}} \approx 51B\]

\[D^* = 20 \times 51B = 1.02T\]

GPT-3 used 3.4× too many parameters and 3.4× too few tokens.

The loss penalty:

\[\Delta L \approx \frac{A}{\Psi_{\text{GPT-3}}^\alpha} + \frac{B}{D_{\text{GPT-3}}^\beta} - \frac{A}{\Psi^{*\alpha}} - \frac{B}{D^{*\beta}}\]

Chinchilla achieved GPT-3's loss with ~4× less compute.

Gopher Analysis¶

Parameters: $\Psi = 280 \times 10^9$
Tokens: $D = 300 \times 10^9$
Compute: $C = 5.04 \times 10^{23}$

Chinchilla-optimal:

\[\Psi^* \approx 65B, \quad D^* \approx 1.3T\]

Gopher was 4.3× overparameterized.

The Chinchilla Trap¶

Chinchilla optimality minimizes loss per FLOP. But this isn't always the right objective.

The Inference Cost Problem¶

A Chinchilla-optimal 70B model with loss $L$ requires 70B multiply-adds per token at inference.

An overtrained 7B model (trained on 10× more data than Chinchilla-optimal) might achieve the same loss $L$ with 10× fewer inference FLOPs.

Total cost comparison:

Approach	Training Cost	Inference Cost (per token)
Chinchilla 70B	$C$	$70B$ MACs
Overtrained 7B	$1.5C$	$7B$ MACs

If you serve >$10^{13}$ tokens, overtraining pays off.

The LLaMA Philosophy¶

LLaMA 1 and 2 deliberately overtrained:

Model	Parameters	Tokens	Tokens/Param	vs Chinchilla
LLaMA-7B	7B	1T	143	7× overtrained
LLaMA-13B	13B	1T	77	3.8× overtrained
LLaMA-65B	65B	1.4T	22	~Chinchilla
LLaMA-2 7B	7B	2T	286	14× overtrained

The 7B model trains on 1-2T tokens—7-14× more than Chinchilla optimal—to minimize serving costs.

Data Quality Considerations¶

The 20:1 rule assumes infinite homogeneous data. In practice:

High-quality data is limited: Once you've exhausted quality sources, more tokens ≠ better
Repetition hurts: Training on repeated data has diminishing returns
Synthetic data: Can extend effective dataset size but may have different scaling properties

When to Deviate from Chinchilla¶

Scenario 1: Inference-Dominated Workloads¶

If inference cost >> training cost, overtrain smaller models.

Break-even analysis: Let $r = \frac{\text{inference tokens}}{\text{training tokens}}$

Overtraining by factor $k$ (using $kD$ tokens on $\Psi/k$ parameters) is profitable when:

\[r > \frac{k^2 \cdot \text{training cost per token}}{\text{inference cost per param}}\]

For typical ratios, overtraining pays off when serving >$10^{13}$ tokens.

Scenario 2: Data Scarcity¶

If high-quality data is exhausted at $D_{\text{max}}$:

\[\Psi^* = \frac{D_{\text{max}}}{20}\]

Training a larger model wastes compute on undertrained parameters.

Scenario 3: Capability Thresholds¶

Some capabilities emerge at specific model sizes, regardless of training tokens.

Chain-of-thought reasoning appears around 60B+ parameters. If you need this capability, a 70B × 200B (undertrained) may outperform 7B × 2T (overtrained) for reasoning tasks.

Scenario 4: Time Constraints¶

Training time scales with tokens. If wall-clock time is the constraint:

\[\text{Time} \propto \frac{D}{\text{throughput}}\]

Fewer tokens = faster training. An undertrained large model may be optimal for "get something working quickly."

Post-Chinchilla Scaling Laws¶

Recent work has refined and extended Chinchilla:

Compute-Optimal vs Downstream-Optimal¶

Chinchilla optimizes pretraining loss. But downstream task performance may have different scaling:

\[\text{Accuracy}(\Psi, D) \neq f(L(\Psi, D))\]

Some tasks benefit more from model size; others from data.

Mixture-of-Experts Scaling¶

MoE models have different scaling:

$\Psi_{\text{total}}$ vs $\Psi_{\text{active}}$
Only active parameters contribute to FLOPs per token
More total parameters can improve loss at fixed compute

Scaling law for MoE:

\[L(\Psi_{\text{active}}, \Psi_{\text{total}}, D) = \frac{A}{\Psi_{\text{total}}^{\alpha_1} \Psi_{\text{active}}^{\alpha_2}} + \frac{B}{D^\beta} + L_\infty\]

Where $\alpha_1$ and $\alpha_2$ are fitted exponents for total and active parameters respectively.

Repeat Tokens¶

What if $D > D_{\text{unique}}$? Muennighoff et al. (2023) showed:

\[L(\Psi, D, r) = L(\Psi, D_{\text{unique}}) + \gamma \cdot \log(r)\]

Where $r = D / D_{\text{unique}}$ is the repetition ratio and $\gamma$ is an empirically fitted coefficient that captures the diminishing value of repeated data.

4 epochs ≈ 4× the unique data is often acceptable; beyond 16 epochs, returns diminish rapidly.

Exercises¶

Compute-optimal calculation: You have $C = 10^{22}$ FLOPs. Calculate the Chinchilla-optimal model size and token count.

Solution

Using the Chinchilla 20:1 rule:

From $C = 6\Psi D$ and $D^* = 20\Psi^*$:

\[C = 6 \times \Psi^* \times 20\Psi^* = 120(\Psi^*)^2\]

Solving for optimal model size: $$\Psi^* = \sqrt{\frac{C}{120}} = \sqrt{\frac{10^{22}}{120}} = \sqrt{8.33 \times 10^{19}}$$

\[\Psi^* = 9.13 \times 10^9 \approx \boxed{9.1\text{B parameters}}\]

Optimal token count: $$D^* = 20 \times \Psi^* = 20 \times 9.1 \times 10^9 = \boxed{182\text{B tokens}}$$

Verification: $$C = 6 \times 9.1 \times 10^9 \times 182 \times 10^9 = 9.94 \times 10^{21} \approx 10^{22} \checkmark$$

Parameter	Value
Compute budget	$10^{22}$ FLOPs
Optimal $\Psi^*$	9.1B
Optimal $D^*$	182B
Tokens/parameter	20

Undertrained analysis: A model has 30B parameters and was trained on 150B tokens.
What's the compute used?
What's the Chinchilla-optimal allocation for that compute?
By what factor is the model over/underparameterized?

Solution

Part 1: Compute used

\[C = 6\Psi D = 6 \times 30 \times 10^9 \times 150 \times 10^9 = \boxed{2.7 \times 10^{22} \text{ FLOPs}}\]

Part 2: Chinchilla-optimal allocation

\[\Psi^* = \sqrt{\frac{C}{120}} = \sqrt{\frac{2.7 \times 10^{22}}{120}} = \sqrt{2.25 \times 10^{20}}\]

\[\Psi^* = 1.5 \times 10^{10} = \boxed{15\text{B parameters}}\]

\[D^* = 20 \times \Psi^* = 20 \times 15 \times 10^9 = \boxed{300\text{B tokens}}\]

Part 3: Over/underparameterization factor

\[\text{Overparameterization} = \frac{\Psi_{\text{actual}}}{\Psi^*} = \frac{30\text{B}}{15\text{B}} = \boxed{2\times \text{ overparameterized}}\]

\[\text{Undertraining} = \frac{D^*}{D_{\text{actual}}} = \frac{300\text{B}}{150\text{B}} = 2\times \text{ undertrained on data}\]

Summary:

Metric	Actual	Optimal	Ratio
Parameters	30B	15B	2× over
Tokens	150B	300B	2× under
Tokens/param	5	20	4× below optimal

The model is severely undertrained: it has 4× fewer tokens per parameter than Chinchilla-optimal.

Inference break-even: A 70B Chinchilla-optimal model costs $10M to train. A 7B overtrained model achieving similar loss costs $15M to train. Inference cost is $0.001 per 1M tokens for 70B, $0.0001 per 1M tokens for 7B. How many tokens must you serve before overtraining is profitable?

Solution

Total cost model:

\[C_{\text{total}} = C_{\text{train}} + C_{\text{inference}} \times T\]

Where $T$ is tokens served (in millions).

For 70B Chinchilla model: $$C_{70B} = \$10\text{M} + \$0.001 \times T$$

For 7B overtrained model: $$C_{7B} = \$15\text{M} + \$0.0001 \times T$$

Break-even condition: $$\$10\text{M} + \$0.001 \times T = \$15\text{M} + \$0.0001 \times T$$

\[\$0.001T - \$0.0001T = \$5\text{M}\]

\[\$0.0009 \times T = \$5\text{M}\]

\[T = \frac{\$5 \times 10^6}{\$0.0009} = 5.56 \times 10^9 \text{ million tokens}\]

\[T = \boxed{5.56 \times 10^{15} \text{ tokens} \approx 5.6 \text{ quadrillion tokens}}\]

Interpretation:

Tokens Served	Cheaper Option
< 5.6 quadrillion	70B Chinchilla
> 5.6 quadrillion	7B overtrained

For context, a high-volume inference service might serve ~1–10T tokens/day. At 5T tokens/day:

\[\text{Break-even time} = \frac{5.6 \times 10^{15}}{5 \times 10^{12}} \approx 1{,}120 \text{ days} \approx 3 \text{ years}\]

For very high-volume inference, overtraining can pay off—but the break-even timeline depends heavily on actual serving volume.

Data budget: You have exactly 500B high-quality tokens. What's the largest model you should train?

Solution

Using the Chinchilla 20:1 rule in reverse:

If $D_{\text{max}} = 500\text{B}$ tokens and optimal ratio is $D^*/\Psi^* = 20$:

\[\Psi^* = \frac{D_{\text{max}}}{20} = \frac{500 \times 10^9}{20} = \boxed{25\text{B parameters}}\]

Compute required: $$C = 6\Psi D = 6 \times 25 \times 10^9 \times 500 \times 10^9 = 7.5 \times 10^{22} \text{ FLOPs}$$

Why not larger?

Model Size	Issue
> 25B	Undertrained (tokens/param < 20)
< 25B	Wasted data capacity
= 25B	Chinchilla-optimal for data budget

Alternative: Accept undertraining

If you train a 50B model on 500B tokens:

Tokens/param = 10 (half of optimal)
~20% higher loss than 25B model trained on same data
But larger model may have emergent capabilities

Recommendation: 25B is optimal for loss; larger sizes trade loss for capability.

MoE analysis: A dense 70B model and a MoE with 70B active / 1T total parameters both train on 1.4T tokens. Which achieves lower loss? (Assume MoE gets ~1.5× the loss reduction per parameter from total vs active)

Solution

Dense 70B model:

Using $L(\Psi, D) = \frac{A}{\Psi^\alpha} + \frac{B}{D^\beta} + L_\infty$:

\[L_{\text{dense}} \propto \frac{1}{(70\text{B})^\alpha}\]

MoE model analysis:

The MoE has: - Active parameters: $\Psi_{\text{active}} = 70\text{B}$ - Total parameters: $\Psi_{\text{total}} = 1\text{T}$

With 1.5× loss reduction from total parameters:

\[L_{\text{MoE}} \propto \frac{1}{(\Psi_{\text{active}})^\alpha \cdot (\Psi_{\text{total}}/\Psi_{\text{active}})^{\alpha/2}}\]

The effective parameter count for loss scaling:

\[\Psi_{\text{eff}} = \Psi_{\text{active}} \cdot \left(\frac{\Psi_{\text{total}}}{\Psi_{\text{active}}}\right)^{0.5} = 70\text{B} \cdot \left(\frac{1000\text{B}}{70\text{B}}\right)^{0.5}\]

\[\Psi_{\text{eff}} = 70\text{B} \cdot 3.78 = 265\text{B}\]

Loss comparison:

Model	Effective $\Psi$	Relative Loss Term
Dense 70B	70B	$(70\text{B})^{-0.34} = 1.00$
MoE 70B/1T	265B	$(265\text{B})^{-0.34} = 0.69$

The MoE achieves ~31% lower parameter-dependent loss with the same compute per forward pass.

Why MoE wins:

Same inference cost (70B active params)
More knowledge stored in experts (1T total params)
Each token routes to specialists
Effective capacity >> active capacity

Caveat: Training MoE requires ~1T parameters in memory/communication, increasing infrastructure complexity. The 1.5× factor is empirical and varies by architecture.

Key Takeaways¶

Pre-Chinchilla models were undertrained: GPT-3, Gopher, etc. used 3-4× too many parameters relative to data.
The 20:1 rule: Compute-optimal training uses ~20 tokens per parameter.
Chinchilla optimizes loss/FLOP: This isn't the same as minimizing inference cost or maximizing capability.
Overtraining is often rational: When inference costs dominate, smaller overtrained models win.
Know your constraints: Data limits, inference budget, time pressure, and capability requirements all shift the optimal allocation.

Approach	Training Cost	Inference Cost (per token)
Chinchilla 70B	\(C\)	\(70B\) MACs
Overtrained 7B	\(1.5C\)	\(7B\) MACs

Parameter	Value
Compute budget	\(10^{22}\) FLOPs
Optimal \(\Psi^*\)	9.1B
Optimal \(D^*\)	182B
Tokens/parameter	20

Model	Effective \(\Psi\)	Relative Loss Term
Dense 70B	70B	\((70\text{B})^{-0.34} = 1.00\)
MoE 70B/1T	265B	\((265\text{B})^{-0.34} = 0.69\)