Skip to content

Section 8.1: Why Training Fails

Reading time: 10 minutes

The Reality of Training

Here's what ML courses don't tell you: most training runs fail.

  • Wrong hyperparameters
  • Bug in data pipeline
  • Numerical instability
  • Architecture mismatch
  • Just bad luck with initialization

The difference between beginners and experts isn't that experts' runs always work—it's that experts can diagnose and fix failures quickly.

The Five Failure Modes

1. Gradient Explosion

What happens: Gradients grow exponentially, loss becomes NaN.

Step 1:    loss = 2.34,  grad_norm = 1.2
Step 10:   loss = 3.56,  grad_norm = 15.4
Step 20:   loss = 45.2,  grad_norm = 234.5
Step 30:   loss = NaN,   grad_norm = Inf

Why it happens:

In a deep network, gradients are products:

\[\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial h_L} \cdot \frac{\partial h_L}{\partial h_{L-1}} \cdots \frac{\partial h_2}{\partial W_1}\]

If each factor is > 1, the product explodes: \(1.1^{100} \approx 10^{4}\)

How to fix:

  • Reduce learning rate
  • Add gradient clipping: grad = min(grad, max_norm)
  • Use LayerNorm/BatchNorm
  • Better initialization

2. Gradient Vanishing

What happens: Gradients shrink to zero, learning stops.

Step 1:     grad_norm = 0.1
Step 100:   grad_norm = 0.001
Step 1000:  grad_norm = 0.0000001
Loss: 2.34 → 2.33 → 2.33 → 2.33 → 2.33 (stuck!)

Why it happens:

Same product, but factors < 1: \(0.9^{100} \approx 10^{-5}\)

Also: sigmoid/tanh saturate at extremes, giving gradients near zero.

How to fix:

  • Use ReLU-family activations
  • Add residual connections: \(h_{l+1} = h_l + f(h_l)\)
  • Use LSTM/GRU for sequences
  • Careful initialization (Xavier/He)

3. Loss Plateau

What happens: Loss stops decreasing but hasn't converged.

Step 1000:  loss = 1.45
Step 2000:  loss = 1.44
Step 3000:  loss = 1.44
Step 4000:  loss = 1.44  (stuck!)

Why it happens:

  • Learning rate too low
  • Stuck in local minimum
  • Model capacity reached
  • Data not shuffled (seeing same patterns)

How to fix:

  • Increase learning rate
  • Use learning rate warmup + decay
  • Try different optimizer (Adam often escapes plateaus)
  • Verify data shuffling
  • Increase model capacity

4. Overfitting

What happens: Training loss decreases, validation loss increases.

Step     Train Loss    Val Loss
1000     1.5           1.6
2000     1.2           1.5
3000     0.8           1.7    ← diverging!
4000     0.4           2.1

Why it happens:

Model memorizes training data instead of learning patterns.

How to fix:

  • Add dropout
  • Weight decay (L2 regularization)
  • Data augmentation
  • Early stopping
  • Reduce model size
  • Get more data

5. Underfitting

What happens: Both training and validation loss remain high.

Step     Train Loss    Val Loss
1000     2.1           2.2
5000     2.0           2.1
10000    2.0           2.1    (both stuck high)

Why it happens:

  • Model too small
  • Features not informative
  • Bug in model or data

How to fix:

  • Increase model capacity
  • Check for bugs (very common!)
  • Verify data pipeline
  • Train longer

The Diagnostic Hierarchy

When training fails, check in this order:

1. Is the data correct?
   - Shapes, types, values
   - Labels match inputs
   - No data leakage

2. Does the model run at all?
   - Forward pass produces output
   - Loss is computed
   - Backward pass runs

3. Can the model overfit one batch?
   - If not, there's a bug
   - This is the most useful test!

4. Are gradients healthy?
   - Not zero, not infinity
   - Flowing to all layers

5. Is the learning rate right?
   - Use LR range test
   - Compare to similar models

The One-Batch Overfit Test

The most powerful debugging technique:

# Take one batch
batch = next(iter(dataloader))

# Train on just this batch for many steps
for step in range(1000):
    loss = train_step(model, batch)
    print(f"Step {step}: loss = {loss:.4f}")

If loss doesn't go to ~0, there's a bug. A neural network should be able to memorize a single batch perfectly.

Common bugs caught:

  • Data/label mismatch
  • Loss function wrong
  • Gradient not flowing
  • Architecture bug

Failure Signatures

Gradient Norm Loss Trend Diagnosis
→ 0 Stuck Vanishing gradients
→ ∞ → NaN Exploding gradients
Stable ↓ slowly LR too low
Oscillating Oscillating LR too high
Stable ↓ then ↑ (val) Overfitting

What's Next

Now that we understand failure modes, we'll learn to read loss curves systematically to diagnose problems before they become catastrophic.