Phase Transitions in Scaling
Scaling is not smooth. There are regimes where model capacity dominates, others where optimizer noise dominates, and transitions between them. Understanding these phases lets us predict when capabilities will emerge—and when throwing more compute won't help.
The Question: A 10B parameter model can't do multi-step arithmetic. A 100B model can. Where did this capability come from? The loss curves are smooth, but the capability appeared suddenly. How do we explain—and predict—these phase transitions?
The Puzzle of Emergence¶
Plot loss against compute: you get a smooth power law. Plot capability against compute: you often get a step function.
Loss Capability
│ │
│╲ │ ┌────
│ ╲ │ │
│ ╲ │ │
│ ╲ │────────┘
│ ╲___ │
└────────────→ Compute └────────────→ Compute
Smooth decay Sharp transition
This is the emergence puzzle: smooth loss improvement hides discrete capability acquisition.
Types of Phase Transitions¶
Type 1: Capability Emergence¶
Some capabilities appear suddenly at scale:
| Capability | Approximate Threshold | Notes |
|---|---|---|
| In-context learning | ~1B parameters | Observed in GPT-2 (1.5B); limited forms seen in smaller models |
| Chain-of-thought reasoning | ~60B parameters | Wei et al. (2022); threshold depends on evaluation methodology |
| Multi-step arithmetic | ~100B parameters | Strongly prompt-dependent |
| Theory of mind | ~100B+ parameters | Highly contested—many researchers dispute LLMs possess this |
Caveat on Emergence
Schaeffer et al. (2023) showed that many "emergent" abilities may be artifacts of discrete evaluation metrics rather than true phase transitions. When using continuous metrics (e.g., token-level accuracy instead of exact-match), many apparent discontinuities become smooth improvements. The thresholds above are approximate, model-dependent, and should be treated with appropriate skepticism.
Mathematical model: Let capability \(c\) depend on loss via:
Where:
- \(L_c\): critical loss threshold for capability \(c\)
- \(\tau\): transition sharpness
As \(\tau \to 0\), the transition becomes a step function.
Type 2: Training Dynamics Phases¶
The optimization process itself has phases:
Phase 1: Random Guessing - Loss ≈ log(vocab_size) - Model outputs uniform distribution - Duration: first few hundred steps
Phase 2: Unigram Learning - Model learns token frequencies - Rapid initial loss drop - Duration: ~1% of training
Phase 3: Bigram/N-gram - Local correlations learned - Slower improvement - Duration: ~5% of training
Phase 4: Semantic Learning - Long-range dependencies - Power law regime - Duration: bulk of training
Phase 5: Memorization - Training loss continues dropping - Validation loss plateaus - Overfitting begins
Type 3: Grokking¶
A phenomenon where: 1. Model memorizes training data (training loss → 0) 2. Validation loss stays high (no generalization) 3. Suddenly, after extended training, validation loss drops 4. Model has "grokked" the underlying pattern
Loss
│
│ Training Validation
│ ↓ ↓
│ ●───────────────────
│ ╲
│ ╲ ●●●●●●╲
│ ╲ ╲
│ ╲____ ╲____
│ ↑
└─────────────────┼────→ Steps
Grokking point
Grokking suggests generalization happens via a distinct phase transition, not gradual improvement.
The 4+3 Phase Model¶
Caballero et al. (2023) identified a richer phase structure in the \((D_{\text{data}}, D_{\text{task}})\) plane, where \(D\) measures complexity.
The Four Main Phases¶
Phase I: Model-Capacity Limited
The model is too small to represent the target function.
Loss is determined by model size, independent of data or optimization.
Symptoms:
- Adding more data doesn't help
- Longer training doesn't help
- Need a bigger model
Phase II: Optimizer-Noise Limited
Gradient noise prevents convergence to the true minimum.
Where \(\sigma^2\) is the variance of per-sample gradients, \(B\) is batch size, \(\eta\) is learning rate, and \(f(\eta)\) captures learning rate effects.
Symptoms:
- Larger batch sizes help
- Learning rate tuning matters a lot
- Loss fluctuates around a floor
Phase III: Data Limited
Not enough data to learn the task.
Symptoms:
- More data directly reduces loss
- Model may be overfitting
- Validation loss >> training loss
Phase IV: Compute-Optimal
The balanced regime where Chinchilla optimality holds.
Symptoms:
- Both model size and data matter
- Smooth power law scaling
- Optimal allocation is \(D \approx 20\Psi\)
The Three Transition Zones¶
Between phases, behavior is complex:
I→IV Transition: As \(\Psi\) increases, you move from capacity-limited to balanced.
II→IV Transition: As \(B\) increases, optimizer noise decreases until data/compute limits dominate.
III→IV Transition: As \(D\) increases, data limitation relaxes.
Double Descent¶
A striking non-monotonic phenomenon:
Test Loss
│
│ ●
│ ● ●
│ ● ● ●
│ ● ● ●
│ ● ● ●
│ ●● ●────
│ ↑
└─────────┼──────────→ Model Size
Interpolation
threshold
Classical regime (small models): Bigger = better generalization
Interpolation threshold (model just fits training data): Overfitting peak
Modern regime (overparameterized): Bigger = better again
This occurs because: 1. Small models underfit → high bias 2. Medium models memorize → high variance 3. Large models find smooth interpolations → low variance
For distributed training: Don't stop at the interpolation threshold. Push through to the modern regime.
Emergent Capabilities: A Deeper Look¶
The Metric Matters¶
Wei et al. (2022) showed that emergence depends on how you measure:
| Metric | Appears Emergent? |
|---|---|
| Accuracy (exact match) | Yes |
| Token-level log-likelihood | No |
| Partial credit scoring | Sometimes |
With log-likelihood, capabilities improve smoothly. With exact-match accuracy, they appear suddenly.
Implication: The "emergence" may be an artifact of discrete evaluation metrics, not the model's internal representations.
The Circuit Formation Hypothesis¶
Capabilities may emerge when internal "circuits" complete:
- Individual components develop gradually
- Circuit requires all components
- Capability appears when last component forms
Like building a bridge: progress on foundations is invisible until the span connects.
Predicting Emergence¶
Given current trends, when will capability \(c\) appear?
Method 1: Extrapolate loss curve, estimate \(L_c\)
Method 2: Use linear probes
Train a linear classifier on intermediate representations. When linear probe accuracy exceeds random:
- The representation contains the capability
- Full capability may emerge soon
Method 3: Partial capability metrics
Design graded evaluations. Look for smooth improvement that predicts discrete threshold.
Implications for Distributed Training¶
1. Scale Planning¶
Know which phase you're in:
| Phase | Optimization Strategy |
|---|---|
| Capacity-limited | Scale model (more GPUs for TP/PP) |
| Data-limited | Improve data pipeline |
| Optimizer-limited | Tune hyperparameters, increase batch |
| Compute-optimal | Balanced scaling |
2. Checkpoint Strategy¶
Near phase transitions:
- Checkpoint more frequently
- Capabilities may appear between evaluations
- Don't stop just before a transition
3. Batch Size Dynamics¶
In optimizer-noise-limited phase:
- Larger batches help
- Can scale batch size with training
In data-limited phase:
- Batch size doesn't help loss
- But larger batch = faster iteration
4. Curriculum Effects¶
Some evidence that training order affects phase transitions:
- Easy examples first may accelerate Phase I→IV transition
- Hard examples early may delay grokking
5. Compute Allocation¶
If targeting a specific capability:
Where \(\Psi_c\) is the capability threshold model size.
Underprovisioning compute below \(C_{\text{needed}}\) wastes everything—you'll never reach the threshold.
The Scaling Hypothesis¶
A strong form of the scaling hypothesis:
Given enough compute, any capability will emerge.
Evidence for:
- Larger models consistently gain capabilities
- No capability has been found that doesn't eventually appear
Evidence against:
- Some capabilities may require architectural changes
- Data quality limits may be fundamental
- Compute/time may be practically infeasible
Practical stance: Assume capabilities will emerge, but plan for uncertainty in thresholds.
Case Study: Arithmetic Capability¶
Tracking multi-digit addition across scales (illustrative—exact numbers are model- and prompt-dependent):
| Model Size | 2-digit | 3-digit | 4-digit | 5-digit |
|---|---|---|---|---|
| 1B | ~95% | ~60% | ~10% | ~0% |
| 10B | ~99% | ~90% | ~50% | ~5% |
| 100B | ~99% | ~99% | ~85% | ~40% |
| 500B | ~99% | ~99% | ~95% | ~80% |
Note: These figures are approximate and illustrative of the general pattern. Actual accuracy depends heavily on the specific model, training data, prompting format, and evaluation methodology.
Each digit complexity has its own phase transition. More digits = higher threshold.
This suggests capabilities have nested phase structures: easier variants emerge first.
Exercises¶
- Phase identification: A 10B model trained on 200B tokens has training loss 2.1 and validation loss 2.3. Adding more data to 500B tokens reduces training loss to 2.0 but validation loss stays at 2.3. What phase is this model in?
Solution
Observations:
| Condition | Training Loss | Validation Loss |
|---|---|---|
| 200B tokens | 2.1 | 2.3 |
| 500B tokens | 2.0 | 2.3 |
Key indicators:
- Validation loss doesn't improve when adding more data
- Training loss continues to decrease (2.1 → 2.0)
- Gap between train/val loss exists and persists
Diagnosis: Phase I — Capacity-Limited
The model has insufficient capacity to generalize better. Adding data improves training loss (more optimization steps) but not validation loss (representation bottleneck).
Evidence against other phases:
| Phase | Expected Behavior | Observed? |
|---|---|---|
| Data-limited | More data → lower val loss | No ✗ |
| Optimizer-limited | Large batch → lower loss | Not tested |
| Compute-optimal | Both \Psi and D matter | Only \Psi matters |
Recommendation: Scale model size (increase \Psi) rather than data or training time.
- Emergence prediction: A capability appears at 50% accuracy for 50B parameters. At 10B parameters, accuracy is 5%. Assuming the transition follows a sigmoid with \(\tau = 0.1\) nats, estimate the loss threshold \(L_c\) for this capability.
Solution
Using the emergence model:
From the 50B data point (50% accuracy):
At 50% accuracy, the sigmoid argument is 0:
From the 10B data point (5% accuracy):
Solving for the exponent:
Estimating the losses:
Using scaling law \(L(\Psi) \propto \Psi^{-\alpha}\) with \(\alpha \approx 0.34\):
If \(L_{50B} = L_c\) and \(L_{10B} = L_c + 0.294\):
The capability threshold \(L_c\) is approximately the loss achieved by a 50B model at the given training stage.
Interpretation: To predict when capability emerges, extrapolate the loss curve and find when \(L = L_c\).
- Double descent: You're training a model and observe test loss increasing. You have compute to either (a) train longer or (b) scale up the model 2×. Which is more likely to help, and why?
Solution
Analyzing the situation:
Test loss increasing indicates we're near the interpolation threshold (double descent peak).
Option (a): Train longer
- May worsen overfitting initially
- Could eventually lead to grokking (if the task allows)
- Risky: may never recover
Option (b): Scale model 2×
- Moves into the overparameterized regime
- Large models find smoother solutions
- More likely to reach "modern regime" of double descent
Recommendation: (b) Scale up the model 2×
| Option | Expected Outcome | Risk |
|---|---|---|
| Train longer | Grokking possible but uncertain | May never recover |
| Scale 2× | Likely enters modern regime | Compute cost |
Mathematical justification:
At interpolation threshold:
Scaling to \(2\Psi\) while keeping \(D\) fixed:
This moves firmly into the overparameterized regime where implicit regularization helps generalization.
- Capability targeting: You need chain-of-thought reasoning, which emerges around 60B parameters. Your compute budget is \(C = 10^{23}\) FLOPs. Is this sufficient? If not, what compute is needed?
Solution
Compute required for 60B Chinchilla-optimal:
Using \(D^* = 20\Psi\) and \(C = 6\Psi D\):
Comparison:
| Budget | Required | Sufficient? |
|---|---|---|
| \(10^{23}\) | \(4.32 \times 10^{23}\) | No (4.3× short) |
Options with \(C = 10^{23}\):
Option 1: Smaller Chinchilla-optimal model $\(\Psi^* = \sqrt{\frac{10^{23}}{120}} = 2.89 \times 10^{10} \approx 29\text{B}\)$
This is below the 60B threshold—chain-of-thought won't emerge.
Option 2: Undertrained 60B model $\(D = \frac{C}{6\Psi} = \frac{10^{23}}{6 \times 60 \times 10^9} = 278\text{B tokens}\)$
Tokens/param = \(278/60 = 4.6\) (severely undertrained)
This might work if capability depends primarily on model size, not training.
Recommendation:
To guarantee chain-of-thought, budget ~4-5× more compute, or accept risk with undertrained 60B.
- Grokking detection: How would you modify your training monitoring to detect grokking early? What metrics would you track?
Solution
Grokking signature:
- Training loss → 0 (memorization)
- Validation loss stays high
- Extended plateau
- Sudden validation loss drop (grokking)
Metrics to track:
| Metric | Purpose | Grokking Signal |
|---|---|---|
| Train/val loss gap | Generalization | Large and stable before grokking |
| Weight norm \(\|W\|\) | Regularization progress | Decreasing during grokking |
| Gradient norm \(\|\nabla L\|\) | Optimization dynamics | Spikes before transition |
| Hessian eigenvalues | Loss landscape | Sharpness decreases |
| Linear probe accuracy | Internal representations | Improves before grokking |
Implementation:
# Track these during training
metrics = {
'train_loss': [],
'val_loss': [],
'generalization_gap': [], # val - train
'weight_norm': [],
'gradient_norm': [],
'probe_accuracy': [], # linear probe on val set
}
# Grokking detection heuristics
def detect_grokking_potential(metrics, window=1000):
# Condition 1: Train loss very low, val loss high
train_converged = metrics['train_loss'][-1] < 0.1
val_high = metrics['val_loss'][-1] > 0.5
# Condition 2: Gap stable for many steps
gap_stable = std(metrics['generalization_gap'][-window:]) < 0.01
# Condition 3: Weight norm still decreasing
weight_decreasing = metrics['weight_norm'][-1] < metrics['weight_norm'][-window]
return train_converged and val_high and weight_decreasing
Actionable recommendations:
- Don't stop early: If train loss is low but val loss high, grokking may be imminent
- Increase weight decay: Can accelerate grokking
- Checkpoint frequently: Save models around potential transition
- Use linear probes: Early warning of representation learning
Key insight: Grokking suggests memorization → generalization transition. Track weight dynamics, not just loss.
Key Takeaways¶
-
Smooth loss hides discrete capabilities: Loss improves gradually; capabilities appear suddenly.
-
Four phases exist: Capacity-limited, optimizer-limited, data-limited, and compute-optimal.
-
Double descent is real: Don't stop at the interpolation threshold—push through.
-
Emergence may be metric-dependent: Use appropriate metrics to track progress toward capabilities.
-
Know your phase: Different phases require different optimization strategies.
-
Plan for thresholds: If targeting a capability, ensure sufficient compute to reach its threshold.