Stage 4: Common Mistakes¶
Mistake 1: Learning Rate Too High¶
Symptom: Loss oscillates wildly or explodes to NaN
Example:
The fix: Reduce learning rate by 10x, use LR finder
Mistake 2: Learning Rate Too Low¶
Symptom: Loss decreases very slowly, training takes forever
Example:
Step 0: loss = 2.34
Step 1000: loss = 2.33
Step 2000: loss = 2.32
# At this rate, convergence will take years
The fix: Increase learning rate, use warmup + higher peak LR
Mistake 3: Forgetting Bias Correction in Adam¶
Wrong implementation:
m = beta1 * m + (1 - beta1) * g
v = beta2 * v + (1 - beta2) * g**2
param -= lr * m / (np.sqrt(v) + eps) # No bias correction!
Problem: Early steps have heavily biased estimates (initialized at 0).
The fix: Apply bias correction
Mistake 4: Not Scaling Learning Rate with Batch Size¶
Symptom: Different batch sizes give very different results
The intuition: Larger batches = more accurate gradients = can take larger steps.
The fix: Linear scaling rule
# Base LR for batch_size 32
base_lr = 0.001
base_batch_size = 32
# Scaled LR for larger batch
actual_batch_size = 256
lr = base_lr * (actual_batch_size / base_batch_size)
Mistake 5: Gradient Explosion Without Clipping¶
Symptom: Sudden loss spike, then NaN
Example:
Step 999: loss = 0.45, grad_norm = 2.3
Step 1000: loss = 0.43, grad_norm = 156.7 # Exploded!
Step 1001: loss = NaN
The fix: Always use gradient clipping
max_grad_norm = 1.0
grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
if grad_norm > max_grad_norm:
scale = max_grad_norm / grad_norm
grads = [g * scale for g in grads]
Mistake 6: Wrong Momentum Initialization¶
Wrong:
The fix: Initialize velocity to zeros
velocity = np.zeros_like(params) # Initialize once
# Then in training loop:
velocity = velocity * beta + grad
Mistake 7: Warmup Too Short¶
Symptom: Training is unstable in first few hundred steps
Wrong:
The fix: Warmup should be 1-10% of training
Mistake 8: Not Decaying Learning Rate¶
Symptom: Loss plateaus, model oscillates around minimum
Wrong:
The fix: Use a schedule
scheduler = CosineAnnealingLR(optimizer, total_steps)
for step in range(total_steps):
optimizer.step()
scheduler.step() # Decay LR over time
Mistake 9: Weight Decay on Wrong Parameters¶
Problem: Applying weight decay to bias terms or LayerNorm
Wrong:
The fix: Exclude certain parameters
decay_params = [p for n, p in model.named_parameters()
if 'bias' not in n and 'norm' not in n]
no_decay_params = [p for n, p in model.named_parameters()
if 'bias' in n or 'norm' in n]
optimizer = AdamW([
{'params': decay_params, 'weight_decay': 0.01},
{'params': no_decay_params, 'weight_decay': 0.0}
])
Mistake 10: Inconsistent Random Seeds¶
Symptom: Can't reproduce results
Wrong:
The fix: Set all seeds