Troubleshooting Guide¶
A unified reference for diagnosing and fixing common problems
This guide consolidates the most common issues across all stages. When something goes wrong, start here.
Quick Diagnosis¶
What's your symptom?¶
| Symptom | Likely Stage | Jump to |
|---|---|---|
| Loss is NaN or Inf | Training | Numerical Issues |
| Loss not decreasing | Training | Loss Plateau |
| Loss oscillating wildly | Optimization | Learning Rate Issues |
| Model outputs garbage | Multiple | Output Quality |
| Out of memory | Architecture | Memory Issues |
| Gradients are zero | Backprop | Gradient Issues |
| Training is very slow | Multiple | Performance Issues |
| Test error much higher than train | Training | Overfitting |
Numerical Issues¶
Loss becomes NaN¶
Symptoms: Loss suddenly jumps to nan or inf
Common causes: 1. Learning rate too high → gradients explode 2. Division by zero (e.g., in normalization) 3. Log of zero or negative number 4. Overflow in softmax with large logits
Diagnosis steps:
# Check for NaN in gradients
for name, param in model.parameters():
if np.isnan(param.grad).any():
print(f"NaN gradient in {name}")
# Check for extreme values
print(f"Max logit: {logits.max()}, Min: {logits.min()}")
Fixes:
1. Reduce learning rate by 10x
2. Add gradient clipping: clip_grad_norm(params, max_norm=1.0)
3. Use numerically stable implementations:
x / (std + 1e-8)
See also: Stage 8: Training Dynamics
Loss Plateau¶
Loss stops decreasing¶
Symptoms: Loss decreases initially, then flatlines
Common causes: 1. Learning rate too low 2. Stuck in local minimum 3. Vanishing gradients 4. Model capacity too small 5. Data exhausted (need more data)
Diagnosis steps:
# Check gradient magnitudes
grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
print(f"Gradient norm: {grad_norm}") # Should be ~0.1-10
# Check if gradients are vanishing
if grad_norm < 1e-7:
print("Vanishing gradients!")
Fixes: 1. Try higher learning rate (use LR finder) 2. Add learning rate warmup 3. Check for missing residual connections 4. Ensure proper initialization 5. Add more model capacity (layers/dimensions)
See also: Stage 8: Loss Curve Analysis
Learning Rate Issues¶
Loss oscillates wildly¶
Symptoms: Loss jumps up and down without trending downward
Cause: Learning rate too high
Fix: Reduce by 2-10x until stable
Loss decreases very slowly¶
Symptoms: Loss barely moves after thousands of steps
Cause: Learning rate too low
Fix: Increase by 2-10x, or use LR finder
Finding the right learning rate¶
# LR Range Test
lrs = np.logspace(-7, 0, 100) # 1e-7 to 1
losses = []
for lr in lrs:
loss = train_one_step(lr)
losses.append(loss)
if loss > 4 * losses[0]: # Explosion
break
# Plot and find steepest descent
# Good LR is ~10x lower than where loss starts rising
See also: Stage 8: Learning Rate Finding
Gradient Issues¶
Gradients are all zero¶
Symptoms: Parameters don't update, loss constant
Common causes:
1. Forgot zero_grad() (gradients accumulating incorrectly)
2. Wrong tensor (detached from graph)
3. Dead ReLU neurons
4. Disconnected computation graph
Diagnosis:
# Check if any gradients are non-zero
has_grad = any(np.any(p.grad != 0) for p in params)
print(f"Has gradients: {has_grad}")
# Check for dead neurons
activations = layer.forward(x)
dead_fraction = (activations == 0).mean()
print(f"Dead neurons: {dead_fraction:.1%}")
Fixes:
1. Call zero_grad() before each backward pass
2. Check computation graph is connected
3. Use LeakyReLU or GELU instead of ReLU
4. Better initialization (He init for ReLU)
Vanishing gradients¶
Symptoms: Early layers have tiny gradients, don't learn
Common causes: 1. Deep network without residual connections 2. Sigmoid/tanh saturation 3. Poor initialization
Fixes:
1. Add residual connections: output = input + layer(input)
2. Use pre-norm architecture
3. Use proper initialization (Xavier/He)
Exploding gradients¶
Symptoms: Gradient norms > 100, loss spikes
Fixes: 1. Gradient clipping (always use for transformers!) 2. Lower learning rate 3. Better initialization
See also: Stage 2: Backpropagation, Stage 8: Gradient Statistics
Output Quality Issues¶
Model outputs repetitive text¶
Symptoms: Generated text repeats phrases or loops
Common causes: 1. Temperature too low 2. No sampling (pure argmax) 3. Training collapse
Fixes: 1. Increase temperature (try 0.7-1.0) 2. Add nucleus (top-p) sampling 3. Add repetition penalty 4. Check training wasn't corrupted
Model outputs nonsense¶
Symptoms: Generated text is incoherent
Common causes: 1. Undertrained model 2. Tokenization mismatch 3. Wrong model weights loaded 4. Corrupted embeddings
Diagnosis:
# Check loss is reasonable
# Random = log(vocab_size), e.g., ~10.8 for vocab=50000
print(f"Loss: {loss} (random would be {np.log(vocab_size):.1f})")
# Verify tokenization roundtrips
text = "Hello world"
assert tokenizer.decode(tokenizer.encode(text)) == text
See also: Stage 7: Tokenization
Memory Issues¶
Out of memory (OOM)¶
Symptoms: CUDA/system out of memory error
Common causes: 1. Batch size too large 2. Sequence length too long 3. Model too large 4. Accumulating tensors in memory
Fixes: 1. Reduce batch size 2. Use gradient accumulation for effective larger batches 3. Reduce sequence length 4. Use gradient checkpointing 5. Use mixed precision (fp16) 6. Use PEFT methods (LoRA) instead of full fine-tuning
Estimation:
# Rough memory estimate (training, fp32)
params = model.count_parameters()
memory_gb = params * 4 * 4 / 1e9 # weights + grads + optimizer states
print(f"Estimated memory: {memory_gb:.1f} GB")
See also: Stage 9: PEFT
Overfitting¶
Validation loss increasing while training loss decreases¶
Symptoms: Model performs well on training data but poorly on validation
Diagnosis:
# Track both losses
print(f"Train: {train_loss:.4f}, Val: {val_loss:.4f}")
# If gap > 0.3-0.5, likely overfitting
Fixes: 1. More training data 2. Data augmentation 3. Dropout (0.1-0.3) 4. Weight decay (0.01-0.1) 5. Early stopping 6. Reduce model capacity 7. Use PEFT instead of full fine-tuning
See also: Stage 8: Debugging Strategies
Performance Issues¶
Training is very slow¶
Common causes: 1. Not using GPU (if available) 2. Data loading bottleneck 3. Too much logging/checkpointing 4. Inefficient operations
Diagnosis:
import time
t0 = time.time()
for i in range(100):
batch = next(data_loader) # Data loading time
t1 = time.time()
output = model(batch) # Forward time
t2 = time.time()
loss.backward() # Backward time
t3 = time.time()
print(f"Data: {t1-t0:.3f}s, Forward: {t2-t1:.3f}s, Backward: {t3-t2:.3f}s")
Fixes: 1. Use GPU if available 2. Increase data loader workers 3. Prefetch data 4. Reduce logging frequency 5. Use compiled/fused operations
Tokenization Issues¶
Many <UNK> tokens¶
Symptoms: Input has many unknown tokens
Cause: Vocabulary doesn't cover input text
Fixes: 1. Retrain tokenizer on representative data 2. Use byte-level tokenization (handles any input) 3. Increase vocabulary size
Tokenization doesn't roundtrip¶
Symptom: decode(encode(text)) != text
Common causes: 1. Normalization differences (Unicode, case) 2. Whitespace handling 3. Special tokens
Fixes: 1. Normalize consistently 2. Check whitespace handling 3. Handle special tokens explicitly
See also: Stage 7: Common Mistakes
Attention Issues¶
Attention weights are uniform¶
Symptom: All attention weights ≈ 1/seq_len
Causes: 1. Missing scaling by √d_k 2. Poor initialization 3. Undertrained
Fix: Ensure scaling: scores = Q @ K.T / np.sqrt(d_k)
Attention looks at future (in causal model)¶
Symptom: Model "cheats" during training, poor generation
Cause: Missing or incorrect causal mask
Fix:
def causal_mask(seq_len):
mask = np.triu(np.ones((seq_len, seq_len)), k=1)
return np.where(mask, float('-inf'), 0)
See also: Stage 5: Common Mistakes
Alignment Issues¶
Reward increases but quality decreases¶
Symptom: Reward hacking - model games the reward
Fixes: 1. Increase KL penalty (β in DPO, KL coefficient in RLHF) 2. Use ensemble of reward models 3. Add output diversity constraints 4. Improve reward model with more data
DPO loss not decreasing¶
Common causes: 1. Learning rate too low (try 1e-6 to 1e-5) 2. β too high (try 0.1) 3. Reference model not frozen 4. Log probability computation wrong
See also: Stage 10: Common Mistakes
Quick Checklist¶
Before asking for help, verify:
- [ ] Data is loaded correctly (print a sample)
- [ ] Shapes are as expected at each layer
- [ ] Loss is computed correctly (cross-entropy for LM)
- [ ] Gradients exist and are non-zero
- [ ] Learning rate is reasonable (try 1e-4 as default)
- [ ] Gradient clipping is enabled (max_norm=1.0)
- [ ] Model is in training mode
- [ ] Random seeds are set for reproducibility
Getting More Help¶
If this guide doesn't solve your problem:
- Check stage-specific common mistakes: Each stage has detailed debugging info
- Read the test files: Tests show expected behavior
- Simplify: Can you reproduce on a tiny example?
- Binary search: Bisect the code to find where it breaks
- Ask with details: Include loss values, shapes, and minimal code