Section 8.1: Why Training Fails¶
Reading time: 10 minutes
The Reality of Training¶
Here's what ML courses don't tell you: most training runs fail.
- Wrong hyperparameters
- Bug in data pipeline
- Numerical instability
- Architecture mismatch
- Just bad luck with initialization
The difference between beginners and experts isn't that experts' runs always work—it's that experts can diagnose and fix failures quickly.
The Five Failure Modes¶
1. Gradient Explosion¶
What happens: Gradients grow exponentially, loss becomes NaN.
Step 1: loss = 2.34, grad_norm = 1.2
Step 10: loss = 3.56, grad_norm = 15.4
Step 20: loss = 45.2, grad_norm = 234.5
Step 30: loss = NaN, grad_norm = Inf
Why it happens:
In a deep network, gradients are products:
If each factor is > 1, the product explodes: \(1.1^{100} \approx 10^{4}\)
How to fix:
- Reduce learning rate
- Add gradient clipping:
grad = min(grad, max_norm) - Use LayerNorm/BatchNorm
- Better initialization
2. Gradient Vanishing¶
What happens: Gradients shrink to zero, learning stops.
Step 1: grad_norm = 0.1
Step 100: grad_norm = 0.001
Step 1000: grad_norm = 0.0000001
Loss: 2.34 → 2.33 → 2.33 → 2.33 → 2.33 (stuck!)
Why it happens:
Same product, but factors < 1: \(0.9^{100} \approx 10^{-5}\)
Also: sigmoid/tanh saturate at extremes, giving gradients near zero.
How to fix:
- Use ReLU-family activations
- Add residual connections: \(h_{l+1} = h_l + f(h_l)\)
- Use LSTM/GRU for sequences
- Careful initialization (Xavier/He)
3. Loss Plateau¶
What happens: Loss stops decreasing but hasn't converged.
Step 1000: loss = 1.45
Step 2000: loss = 1.44
Step 3000: loss = 1.44
Step 4000: loss = 1.44 (stuck!)
Why it happens:
- Learning rate too low
- Stuck in local minimum
- Model capacity reached
- Data not shuffled (seeing same patterns)
How to fix:
- Increase learning rate
- Use learning rate warmup + decay
- Try different optimizer (Adam often escapes plateaus)
- Verify data shuffling
- Increase model capacity
4. Overfitting¶
What happens: Training loss decreases, validation loss increases.
Why it happens:
Model memorizes training data instead of learning patterns.
How to fix:
- Add dropout
- Weight decay (L2 regularization)
- Data augmentation
- Early stopping
- Reduce model size
- Get more data
5. Underfitting¶
What happens: Both training and validation loss remain high.
Why it happens:
- Model too small
- Features not informative
- Bug in model or data
How to fix:
- Increase model capacity
- Check for bugs (very common!)
- Verify data pipeline
- Train longer
The Diagnostic Hierarchy¶
When training fails, check in this order:
1. Is the data correct?
- Shapes, types, values
- Labels match inputs
- No data leakage
2. Does the model run at all?
- Forward pass produces output
- Loss is computed
- Backward pass runs
3. Can the model overfit one batch?
- If not, there's a bug
- This is the most useful test!
4. Are gradients healthy?
- Not zero, not infinity
- Flowing to all layers
5. Is the learning rate right?
- Use LR range test
- Compare to similar models
The One-Batch Overfit Test¶
The most powerful debugging technique:
# Take one batch
batch = next(iter(dataloader))
# Train on just this batch for many steps
for step in range(1000):
loss = train_step(model, batch)
print(f"Step {step}: loss = {loss:.4f}")
If loss doesn't go to ~0, there's a bug. A neural network should be able to memorize a single batch perfectly.
Common bugs caught:
- Data/label mismatch
- Loss function wrong
- Gradient not flowing
- Architecture bug
Failure Signatures¶
| Gradient Norm | Loss Trend | Diagnosis |
|---|---|---|
| → 0 | Stuck | Vanishing gradients |
| → ∞ | → NaN | Exploding gradients |
| Stable | ↓ slowly | LR too low |
| Oscillating | Oscillating | LR too high |
| Stable | ↓ then ↑ (val) | Overfitting |
What's Next¶
Now that we understand failure modes, we'll learn to read loss curves systematically to diagnose problems before they become catastrophic.