Phase Transitions in Scaling

Scaling is not smooth. There are regimes where model capacity dominates, others where optimizer noise dominates, and transitions between them. Understanding these phases lets us predict when capabilities will emerge—and when throwing more compute won't help.

The Question: A 10B parameter model can't do multi-step arithmetic. A 100B model can. Where did this capability come from? The loss curves are smooth, but the capability appeared suddenly. How do we explain—and predict—these phase transitions?

The Puzzle of Emergence¶

Plot loss against compute: you get a smooth power law. Plot capability against compute: you often get a step function.

Loss                          Capability
  │                              │
  │╲                             │        ┌────
  │ ╲                            │        │
  │  ╲                           │        │
  │   ╲                          │────────┘
  │    ╲___                      │
  └────────────→ Compute         └────────────→ Compute
    Smooth decay                   Sharp transition

This is the emergence puzzle: smooth loss improvement hides discrete capability acquisition.

Types of Phase Transitions¶

Type 1: Capability Emergence¶

Some capabilities appear suddenly at scale:

Capability	Approximate Threshold	Notes
In-context learning	~1B parameters	Observed in GPT-2 (1.5B); limited forms seen in smaller models
Chain-of-thought reasoning	~60B parameters	Wei et al. (2022); threshold depends on evaluation methodology
Multi-step arithmetic	~100B parameters	Strongly prompt-dependent
Theory of mind	~100B+ parameters	Highly contested—many researchers dispute LLMs possess this

Caveat on Emergence

Schaeffer et al. (2023) showed that many "emergent" abilities may be artifacts of discrete evaluation metrics rather than true phase transitions. When using continuous metrics (e.g., token-level accuracy instead of exact-match), many apparent discontinuities become smooth improvements. The thresholds above are approximate, model-dependent, and should be treated with appropriate skepticism.

Mathematical model: Let capability $c$ depend on loss via:

\[P(c | L) = \text{sigmoid}\left(\frac{L_c - L}{\tau}\right) = \frac{1}{1 + e^{-(L_c - L)/\tau}}\]

Where:

$L_c$: critical loss threshold for capability $c$
$\tau$: transition sharpness

As $\tau \to 0$, the transition becomes a step function.

Type 2: Training Dynamics Phases¶

The optimization process itself has phases:

Phase 1: Random Guessing - Loss ≈ log(vocab_size) - Model outputs uniform distribution - Duration: first few hundred steps

Phase 2: Unigram Learning - Model learns token frequencies - Rapid initial loss drop - Duration: ~1% of training

Phase 3: Bigram/N-gram - Local correlations learned - Slower improvement - Duration: ~5% of training

Phase 4: Semantic Learning - Long-range dependencies - Power law regime - Duration: bulk of training

Phase 5: Memorization - Training loss continues dropping - Validation loss plateaus - Overfitting begins

Type 3: Grokking¶

A phenomenon where: 1. Model memorizes training data (training loss → 0) 2. Validation loss stays high (no generalization) 3. Suddenly, after extended training, validation loss drops 4. Model has "grokked" the underlying pattern

Loss
  │
  │ Training     Validation
  │    ↓            ↓
  │    ●───────────────────
  │     ╲
  │      ╲    ●●●●●●╲
  │       ╲          ╲
  │        ╲____      ╲____
  │                 ↑
  └─────────────────┼────→ Steps
              Grokking point

Grokking suggests generalization happens via a distinct phase transition, not gradual improvement.

The 4+3 Phase Model¶

Caballero et al. (2023) identified a richer phase structure in the $(D_{\text{data}}, D_{\text{task}})$ plane, where $D$ measures complexity.

The Four Main Phases¶

Phase I: Model-Capacity Limited

The model is too small to represent the target function.

\[L \approx L_{\infty}(\Psi) = \frac{A}{\Psi^\alpha}\]

Loss is determined by model size, independent of data or optimization.

Symptoms:

Adding more data doesn't help
Longer training doesn't help
Need a bigger model

Phase II: Optimizer-Noise Limited

Gradient noise prevents convergence to the true minimum.

\[L \approx L_{\text{opt}}(B, \eta) = \frac{\sigma^2}{B \cdot f(\eta)}\]

Where $\sigma^2$ is the variance of per-sample gradients, $B$ is batch size, $\eta$ is learning rate, and $f(\eta)$ captures learning rate effects.

Symptoms:

Larger batch sizes help
Learning rate tuning matters a lot
Loss fluctuates around a floor

Phase III: Data Limited

Not enough data to learn the task.

\[L \approx \frac{B}{D^\beta}\]

Symptoms:

More data directly reduces loss
Model may be overfitting
Validation loss >> training loss

Phase IV: Compute-Optimal

The balanced regime where Chinchilla optimality holds.

\[L = \frac{A}{\Psi^\alpha} + \frac{B}{D^\beta} + L_\infty\]

Symptoms:

Both model size and data matter
Smooth power law scaling
Optimal allocation is $D \approx 20\Psi$

The Three Transition Zones¶

Between phases, behavior is complex:

I→IV Transition: As $\Psi$ increases, you move from capacity-limited to balanced.

II→IV Transition: As $B$ increases, optimizer noise decreases until data/compute limits dominate.

III→IV Transition: As $D$ increases, data limitation relaxes.

Double Descent¶

A striking non-monotonic phenomenon:

Test Loss
    │
    │  ●
    │   ●        ●
    │    ●      ● ●
    │     ●    ●   ●
    │      ●  ●     ●
    │       ●●       ●────
    │         ↑
    └─────────┼──────────→ Model Size
         Interpolation
           threshold

Classical regime (small models): Bigger = better generalization

Interpolation threshold (model just fits training data): Overfitting peak

Modern regime (overparameterized): Bigger = better again

This occurs because: 1. Small models underfit → high bias 2. Medium models memorize → high variance 3. Large models find smooth interpolations → low variance

For distributed training: Don't stop at the interpolation threshold. Push through to the modern regime.

Emergent Capabilities: A Deeper Look¶

The Metric Matters¶

Wei et al. (2022) showed that emergence depends on how you measure:

Metric	Appears Emergent?
Accuracy (exact match)	Yes
Token-level log-likelihood	No
Partial credit scoring	Sometimes

With log-likelihood, capabilities improve smoothly. With exact-match accuracy, they appear suddenly.

Implication: The "emergence" may be an artifact of discrete evaluation metrics, not the model's internal representations.

The Circuit Formation Hypothesis¶

Capabilities may emerge when internal "circuits" complete:

Individual components develop gradually
Circuit requires all components
Capability appears when last component forms

Like building a bridge: progress on foundations is invisible until the span connects.

Predicting Emergence¶

Given current trends, when will capability $c$ appear?

Method 1: Extrapolate loss curve, estimate $L_c$

\[\Psi_c \approx \left(\frac{A}{L_c - L_\infty}\right)^{1/\alpha}\]

Method 2: Use linear probes

Train a linear classifier on intermediate representations. When linear probe accuracy exceeds random:

The representation contains the capability
Full capability may emerge soon

Method 3: Partial capability metrics

Design graded evaluations. Look for smooth improvement that predicts discrete threshold.

Implications for Distributed Training¶

1. Scale Planning¶

Know which phase you're in:

Phase	Optimization Strategy
Capacity-limited	Scale model (more GPUs for TP/PP)
Data-limited	Improve data pipeline
Optimizer-limited	Tune hyperparameters, increase batch
Compute-optimal	Balanced scaling

2. Checkpoint Strategy¶

Near phase transitions:

Checkpoint more frequently
Capabilities may appear between evaluations
Don't stop just before a transition

3. Batch Size Dynamics¶

In optimizer-noise-limited phase:

Larger batches help
Can scale batch size with training

In data-limited phase:

Batch size doesn't help loss
But larger batch = faster iteration

4. Curriculum Effects¶

Some evidence that training order affects phase transitions:

Easy examples first may accelerate Phase I→IV transition
Hard examples early may delay grokking

5. Compute Allocation¶

If targeting a specific capability:

\[C_{\text{needed}} = 6 \cdot \Psi_c \cdot D_{\text{opt}}(\Psi_c)\]

Where $\Psi_c$ is the capability threshold model size.

Underprovisioning compute below $C_{\text{needed}}$ wastes everything—you'll never reach the threshold.

The Scaling Hypothesis¶

A strong form of the scaling hypothesis:

Given enough compute, any capability will emerge.

Evidence for:

Larger models consistently gain capabilities
No capability has been found that doesn't eventually appear

Evidence against:

Some capabilities may require architectural changes
Data quality limits may be fundamental
Compute/time may be practically infeasible

Practical stance: Assume capabilities will emerge, but plan for uncertainty in thresholds.

Case Study: Arithmetic Capability¶

Tracking multi-digit addition across scales (illustrative—exact numbers are model- and prompt-dependent):

Model Size	2-digit	3-digit	4-digit	5-digit
1B	~95%	~60%	~10%	~0%
10B	~99%	~90%	~50%	~5%
100B	~99%	~99%	~85%	~40%
500B	~99%	~99%	~95%	~80%

Note: These figures are approximate and illustrative of the general pattern. Actual accuracy depends heavily on the specific model, training data, prompting format, and evaluation methodology.

Each digit complexity has its own phase transition. More digits = higher threshold.

This suggests capabilities have nested phase structures: easier variants emerge first.

Exercises¶

Phase identification: A 10B model trained on 200B tokens has training loss 2.1 and validation loss 2.3. Adding more data to 500B tokens reduces training loss to 2.0 but validation loss stays at 2.3. What phase is this model in?

Solution

Observations:

Condition	Training Loss	Validation Loss
200B tokens	2.1	2.3
500B tokens	2.0	2.3

Key indicators:

Validation loss doesn't improve when adding more data
Training loss continues to decrease (2.1 → 2.0)
Gap between train/val loss exists and persists

Diagnosis: Phase I — Capacity-Limited

The model has insufficient capacity to generalize better. Adding data improves training loss (more optimization steps) but not validation loss (representation bottleneck).

\[L_{\text{val}} \approx L_\infty(\Psi) = \frac{A}{\Psi^\alpha} + \text{irreducible}\]

Evidence against other phases:

Phase	Expected Behavior	Observed?
Data-limited	More data → lower val loss	No ✗
Optimizer-limited	Large batch → lower loss	Not tested
Compute-optimal	Both \Psi and D matter	Only \Psi matters

Recommendation: Scale model size (increase \Psi) rather than data or training time.

Emergence prediction: A capability appears at 50% accuracy for 50B parameters. At 10B parameters, accuracy is 5%. Assuming the transition follows a sigmoid with $\tau = 0.1$ nats, estimate the loss threshold $L_c$ for this capability.

Solution

Using the emergence model:

\[P(c | L) = \text{sigmoid}\left(\frac{L_c - L}{\tau}\right) = \frac{1}{1 + e^{-(L_c - L)/\tau}}\]

From the 50B data point (50% accuracy):

At 50% accuracy, the sigmoid argument is 0:

\[\text{sigmoid}(0) = 0.5 \implies \frac{L_c - L_{50B}}{\tau} = 0\]

\[\boxed{L_c = L_{50B}}\]

From the 10B data point (5% accuracy):

\[0.05 = \frac{1}{1 + e^{-(L_c - L_{10B})/\tau}}\]

Solving for the exponent:

\[1 + e^{-(L_c - L_{10B})/0.1} = 20\]

\[e^{-(L_c - L_{10B})/0.1} = 19\]

\[-(L_c - L_{10B})/0.1 = \ln(19) = 2.94\]

\[L_c - L_{10B} = -0.294 \text{ nats}\]

Estimating the losses:

Using scaling law $L(\Psi) \propto \Psi^{-\alpha}$ with $\alpha \approx 0.34$:

\[\frac{L_{10B} - L_\infty}{L_{50B} - L_\infty} = \left(\frac{50}{10}\right)^{0.34} = 5^{0.34} = 1.70\]

If $L_{50B} = L_c$ and $L_{10B} = L_c + 0.294$:

The capability threshold $L_c$ is approximately the loss achieved by a 50B model at the given training stage.

Interpretation: To predict when capability emerges, extrapolate the loss curve and find when $L = L_c$.

Double descent: You're training a model and observe test loss increasing. You have compute to either (a) train longer or (b) scale up the model 2×. Which is more likely to help, and why?

Solution

Analyzing the situation:

Test loss increasing indicates we're near the interpolation threshold (double descent peak).

Test Loss
    │
    │   ●  ← You are here
    │  ● ●
    │ ●   ●
    │●     ●
    │       ●────
    └────────────→ Model Size / Training Time

Option (a): Train longer

May worsen overfitting initially
Could eventually lead to grokking (if the task allows)
Risky: may never recover

Option (b): Scale model 2×

Moves into the overparameterized regime
Large models find smoother solutions
More likely to reach "modern regime" of double descent

Recommendation: (b) Scale up the model 2×

Option	Expected Outcome	Risk
Train longer	Grokking possible but uncertain	May never recover
Scale 2×	Likely enters modern regime	Compute cost

Mathematical justification:

At interpolation threshold:

\[\Psi \approx D_{\text{train}}\]

Scaling to $2\Psi$ while keeping $D$ fixed:

\[\frac{\Psi_{\text{new}}}{D} = 2 \gg 1\]

This moves firmly into the overparameterized regime where implicit regularization helps generalization.

Capability targeting: You need chain-of-thought reasoning, which emerges around 60B parameters. Your compute budget is $C = 10^{23}$ FLOPs. Is this sufficient? If not, what compute is needed?

Solution

Compute required for 60B Chinchilla-optimal:

Using $D^* = 20\Psi$ and $C = 6\Psi D$:

\[C_{\text{needed}} = 6 \times \Psi \times 20\Psi = 120\Psi^2\]

\[C_{\text{needed}} = 120 \times (60 \times 10^9)^2 = 120 \times 3.6 \times 10^{21}\]

\[C_{\text{needed}} = 4.32 \times 10^{23} \text{ FLOPs}\]

Comparison:

Budget	Required	Sufficient?
$10^{23}$	$4.32 \times 10^{23}$	No (4.3× short)

Options with $C = 10^{23}$:

Option 1: Smaller Chinchilla-optimal model $$\Psi^* = \sqrt{\frac{10^{23}}{120}} = 2.89 \times 10^{10} \approx 29\text{B}$$

This is below the 60B threshold—chain-of-thought won't emerge.

Option 2: Undertrained 60B model $$D = \frac{C}{6\Psi} = \frac{10^{23}}{6 \times 60 \times 10^9} = 278\text{B tokens}$$

Tokens/param = $278/60 = 4.6$ (severely undertrained)

This might work if capability depends primarily on model size, not training.

Recommendation:

\[\boxed{C_{\text{needed}} = 4.32 \times 10^{23} \text{ FLOPs}}\]

To guarantee chain-of-thought, budget ~4-5× more compute, or accept risk with undertrained 60B.

Grokking detection: How would you modify your training monitoring to detect grokking early? What metrics would you track?

Solution

Grokking signature:

Training loss → 0 (memorization)
Validation loss stays high
Extended plateau
Sudden validation loss drop (grokking)

Metrics to track:

Metric	Purpose	Grokking Signal
Train/val loss gap	Generalization	Large and stable before grokking
Weight norm $\\|W\\|$	Regularization progress	Decreasing during grokking
Gradient norm $\\|\nabla L\\|$	Optimization dynamics	Spikes before transition
Hessian eigenvalues	Loss landscape	Sharpness decreases
Linear probe accuracy	Internal representations	Improves before grokking

Implementation:

# Track these during training
metrics = {
    'train_loss': [],
    'val_loss': [],
    'generalization_gap': [],  # val - train
    'weight_norm': [],
    'gradient_norm': [],
    'probe_accuracy': [],  # linear probe on val set
}

# Grokking detection heuristics
def detect_grokking_potential(metrics, window=1000):
    # Condition 1: Train loss very low, val loss high
    train_converged = metrics['train_loss'][-1] < 0.1
    val_high = metrics['val_loss'][-1] > 0.5

    # Condition 2: Gap stable for many steps
    gap_stable = std(metrics['generalization_gap'][-window:]) < 0.01

    # Condition 3: Weight norm still decreasing
    weight_decreasing = metrics['weight_norm'][-1] < metrics['weight_norm'][-window]

    return train_converged and val_high and weight_decreasing

Actionable recommendations:

Don't stop early: If train loss is low but val loss high, grokking may be imminent
Increase weight decay: Can accelerate grokking
Checkpoint frequently: Save models around potential transition
Use linear probes: Early warning of representation learning

Key insight: Grokking suggests memorization → generalization transition. Track weight dynamics, not just loss.

Key Takeaways¶

Smooth loss hides discrete capabilities: Loss improves gradually; capabilities appear suddenly.
Four phases exist: Capacity-limited, optimizer-limited, data-limited, and compute-optimal.
Double descent is real: Don't stop at the interpolation threshold—push through.
Emergence may be metric-dependent: Use appropriate metrics to track progress toward capabilities.
Know your phase: Different phases require different optimization strategies.
Plan for thresholds: If targeting a capability, ensure sufficient compute to reach its threshold.