Section 8.6: Debugging Strategies¶
Reading time: 12 minutes
The Reality of ML Debugging¶
Most training runs fail. The difference between beginners and experts isn't success rate—it's debugging speed.
This section provides systematic approaches to diagnose and fix training problems.
The Debugging Hierarchy¶
When training fails, check in this order:
Most bugs are in data. Hyperparameters are almost never the first problem.
Level 1: Data Debugging¶
Check first. Always.
1.1 Data Shapes¶
def debug_data(dataloader):
batch = next(iter(dataloader))
x, y = batch
print(f"Input shape: {x.shape}")
print(f"Target shape: {y.shape}")
print(f"Input dtype: {x.dtype}")
print(f"Target dtype: {y.dtype}")
print(f"Input range: [{x.min():.2f}, {x.max():.2f}]")
print(f"Target range: [{y.min():.2f}, {y.max():.2f}]")
Common issues:
- Shapes don't match expected dimensions
- Wrong dtype (int when should be float)
- Values not normalized (raw pixels 0-255 instead of 0-1)
1.2 Data/Label Alignment¶
def verify_alignment(x, y, show_n=3):
"""Visually verify inputs match targets."""
for i in range(min(show_n, len(x))):
print(f"Sample {i}:")
print(f" Input: {x[i][:20]}...") # First 20 values
print(f" Target: {y[i]}")
Common issue: Shuffled inputs but not targets, or vice versa.
1.3 Data Leakage¶
def check_leakage(train_data, val_data):
"""Check if validation data appears in training set."""
train_set = set(map(tuple, train_data))
val_set = set(map(tuple, val_data))
overlap = train_set & val_set
if overlap:
print(f"LEAKAGE: {len(overlap)} samples in both sets!")
Level 2: Model Debugging¶
2.1 Forward Pass Verification¶
def debug_forward(model, input_shape):
"""Verify forward pass works and produces expected output."""
x = np.random.randn(*input_shape).astype(np.float32)
y = model.forward(x)
print(f"Input shape: {x.shape}")
print(f"Output shape: {y.shape}")
print(f"Output range: [{y.min():.4f}, {y.max():.4f}]")
print(f"Output contains NaN: {np.any(np.isnan(y))}")
2.2 Parameter Count¶
def count_parameters(model):
"""Count trainable parameters."""
total = 0
for name, param in model.parameters():
n = np.prod(param.shape)
print(f"{name}: {param.shape} = {n:,} params")
total += n
print(f"Total: {total:,} parameters")
Common issue: Model too small (can't learn) or too large (overfits instantly).
2.3 Gradient Flow Check¶
def check_gradient_flow(model, x, y):
"""Verify gradients flow to all layers."""
loss = model.compute_loss(x, y)
gradients = model.backward()
for name, grad in zip(model.layer_names, gradients):
if grad is None:
print(f"{name}: NO GRADIENT!")
elif np.allclose(grad, 0):
print(f"{name}: ZERO GRADIENT!")
else:
print(f"{name}: grad_norm = {np.linalg.norm(grad):.4f}")
Level 3: The One-Batch Overfit Test¶
The single most useful debugging technique.
def one_batch_overfit_test(model, dataloader, steps=1000):
"""
Can the model memorize a single batch?
If not, there's a bug. A neural network should be able
to perfectly memorize one batch.
"""
batch = next(iter(dataloader))
for step in range(steps):
loss = model.train_step(batch)
if step % 100 == 0:
print(f"Step {step}: loss = {loss:.6f}")
if loss > 0.1:
print("FAILED: Model cannot overfit single batch!")
print("Likely bugs: loss function, gradient flow, architecture")
else:
print("PASSED: Model can memorize one batch")
What Failures Mean¶
| Final Loss | Meaning |
|---|---|
| > 1.0 | Major bug somewhere |
| 0.1 - 1.0 | Possibly learning too slowly |
| < 0.01 | Working correctly |
Level 4: Training Debugging¶
4.1 Monitor Everything¶
def training_loop_debug(model, dataloader, steps):
"""Training loop with comprehensive monitoring."""
for step, (x, y) in enumerate(dataloader):
# Forward
loss = model.forward_loss(x, y)
# Check for NaN
if np.isnan(loss):
print(f"NaN at step {step}!")
print("Last gradient norms:", [np.linalg.norm(g) for g in grads])
break
# Backward
grads = model.backward()
# Monitor gradients
grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
if grad_norm > 100:
print(f"Warning: grad_norm = {grad_norm:.2f} at step {step}")
# Update
model.update(grads, lr=1e-3)
if step % 100 == 0:
print(f"Step {step}: loss={loss:.4f}, grad_norm={grad_norm:.4f}")
4.2 Gradient Checking¶
Verify gradients are computed correctly:
def numerical_gradient_check(model, x, y, epsilon=1e-5):
"""Compare analytical gradients to numerical approximation."""
# Get analytical gradients
loss = model.forward_loss(x, y)
analytical_grads = model.backward()
# Compute numerical gradients
for param in model.parameters():
param_flat = param.flatten()
numerical_grad = np.zeros_like(param_flat)
for i in range(len(param_flat)):
# f(x + epsilon)
param_flat[i] += epsilon
loss_plus = model.forward_loss(x, y)
# f(x - epsilon)
param_flat[i] -= 2 * epsilon
loss_minus = model.forward_loss(x, y)
# Restore
param_flat[i] += epsilon
# Numerical gradient
numerical_grad[i] = (loss_plus - loss_minus) / (2 * epsilon)
# Compare
analytical_flat = analytical_grads.flatten()
diff = np.abs(analytical_flat - numerical_grad)
relative_diff = diff / (np.abs(analytical_flat) + 1e-8)
if np.max(relative_diff) > 0.01:
print(f"Gradient mismatch! Max relative diff: {np.max(relative_diff)}")
Level 5: Hyperparameter Debugging¶
Only after data, model, and training are verified correct.
5.1 Learning Rate¶
# Too high: oscillation or explosion
lr_high_symptoms = ['loss oscillates', 'loss → NaN', 'grad_norm spikes']
# Too low: very slow progress
lr_low_symptoms = ['loss barely moves', 'training takes forever']
# Just right: smooth decrease
lr_good_symptoms = ['loss decreases steadily', 'grad_norm stable']
5.2 Batch Size¶
| Too Small | Too Large |
|---|---|
| High variance in gradients | Slow progress |
| Might escape local minima | Might get stuck |
| Slower training | Needs higher LR |
5.3 Model Capacity¶
Underfitting: Both train and val loss high → increase capacity
Overfitting: Train low, val high → decrease capacity or add regularization
The Complete Debugging Checklist¶
## Stage 1: Data
- [ ] Input shapes correct
- [ ] Target shapes correct
- [ ] Data types correct
- [ ] Values normalized
- [ ] Data/label alignment verified
- [ ] No data leakage
## Stage 2: Model
- [ ] Forward pass produces output
- [ ] Output shape matches targets
- [ ] No NaN in outputs
- [ ] Parameter count reasonable
- [ ] Gradients flow to all layers
## Stage 3: Single Batch Test
- [ ] Model can overfit one batch to loss < 0.01
## Stage 4: Training
- [ ] Loss decreases initially
- [ ] No NaN during training
- [ ] Gradient norms stable
- [ ] No gradient explosion/vanishing
## Stage 5: Hyperparameters (only if above passes)
- [ ] LR range test performed
- [ ] Batch size appropriate
- [ ] Model capacity appropriate
Common Bugs and Fixes¶
| Bug | Symptom | Fix |
|---|---|---|
| Data not shuffled | Loss plateaus early | Shuffle each epoch |
| Wrong loss function | Loss doesn't match task | Match loss to task |
| Gradients not zeroed | Gradients accumulate | Zero grads before backward |
| Model in eval mode | No learning | Set train mode |
| Wrong axis in softmax | Probabilities wrong | Check axis parameter |
| Integer division | LR becomes 0 | Use float division |
Debug Output Example¶
========================================
Training Debug Report
========================================
Data Check:
Input shape: (32, 100) ✓
Target shape: (32,) ✓
Input range: [-1.2, 1.4] ✓
Labels unique: [0, 1, 2] ✓
Model Check:
Parameters: 15,234 ✓
Output shape: (32, 3) ✓
Gradient flow: all layers ✓
Single Batch Test:
Step 0: loss=1.098
Step 100: loss=0.245
Step 200: loss=0.012
PASSED ✓
Training (first 500 steps):
Loss: 1.098 → 0.456 ✓
Grad norm: stable (0.1-0.5) ✓
Val loss: 0.523 ✓
Status: HEALTHY
========================================
Summary¶
- Start with data - Most bugs are here
- Run one-batch test - Catches most model bugs
- Monitor everything - Detect issues early
- Use systematic checklists - Don't skip steps
- Hyperparameters last - Only after everything else works
Key insight: Debugging is systematic, not random. Follow the hierarchy.
Next: We'll implement all these diagnostic tools in Python.