Troubleshooting Guide¶

A unified reference for diagnosing and fixing common problems

This guide consolidates the most common issues across all stages. When something goes wrong, start here.

Quick Diagnosis¶

What's your symptom?¶

Symptom	Likely Stage	Jump to
Loss is NaN or Inf	Training	Numerical Issues
Loss not decreasing	Training	Loss Plateau
Loss oscillating wildly	Optimization	Learning Rate Issues
Model outputs garbage	Multiple	Output Quality
Out of memory	Architecture	Memory Issues
Gradients are zero	Backprop	Gradient Issues
Training is very slow	Multiple	Performance Issues
Test error much higher than train	Training	Overfitting

Numerical Issues¶

Loss becomes NaN¶

Symptoms: Loss suddenly jumps to nan or inf

Common causes: 1. Learning rate too high → gradients explode 2. Division by zero (e.g., in normalization) 3. Log of zero or negative number 4. Overflow in softmax with large logits

Diagnosis steps:

# Check for NaN in gradients
for name, param in model.parameters():
    if np.isnan(param.grad).any():
        print(f"NaN gradient in {name}")

# Check for extreme values
print(f"Max logit: {logits.max()}, Min: {logits.min()}")

Fixes: 1. Reduce learning rate by 10x 2. Add gradient clipping: clip_grad_norm(params, max_norm=1.0) 3. Use numerically stable implementations:

# Bad: log(softmax(x))
# Good: log_softmax(x)
log_probs = x - logsumexp(x, axis=-1, keepdims=True)

4. Add epsilon to denominators: x / (std + 1e-8)

See also: Stage 8: Training Dynamics

Loss Plateau¶

Loss stops decreasing¶

Symptoms: Loss decreases initially, then flatlines

Common causes: 1. Learning rate too low 2. Stuck in local minimum 3. Vanishing gradients 4. Model capacity too small 5. Data exhausted (need more data)

Diagnosis steps:

# Check gradient magnitudes
grad_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
print(f"Gradient norm: {grad_norm}")  # Should be ~0.1-10

# Check if gradients are vanishing
if grad_norm < 1e-7:
    print("Vanishing gradients!")

Fixes: 1. Try higher learning rate (use LR finder) 2. Add learning rate warmup 3. Check for missing residual connections 4. Ensure proper initialization 5. Add more model capacity (layers/dimensions)

See also: Stage 8: Loss Curve Analysis

Learning Rate Issues¶

Loss oscillates wildly¶

Symptoms: Loss jumps up and down without trending downward

Cause: Learning rate too high

Fix: Reduce by 2-10x until stable

Loss decreases very slowly¶

Symptoms: Loss barely moves after thousands of steps

Cause: Learning rate too low

Fix: Increase by 2-10x, or use LR finder

Finding the right learning rate¶

# LR Range Test
lrs = np.logspace(-7, 0, 100)  # 1e-7 to 1
losses = []

for lr in lrs:
    loss = train_one_step(lr)
    losses.append(loss)
    if loss > 4 * losses[0]:  # Explosion
        break

# Plot and find steepest descent
# Good LR is ~10x lower than where loss starts rising

See also: Stage 8: Learning Rate Finding

Gradient Issues¶

Gradients are all zero¶

Symptoms: Parameters don't update, loss constant

Common causes: 1. Forgot zero_grad() (gradients accumulating incorrectly) 2. Wrong tensor (detached from graph) 3. Dead ReLU neurons 4. Disconnected computation graph

Diagnosis:

# Check if any gradients are non-zero
has_grad = any(np.any(p.grad != 0) for p in params)
print(f"Has gradients: {has_grad}")

# Check for dead neurons
activations = layer.forward(x)
dead_fraction = (activations == 0).mean()
print(f"Dead neurons: {dead_fraction:.1%}")

Fixes: 1. Call zero_grad() before each backward pass 2. Check computation graph is connected 3. Use LeakyReLU or GELU instead of ReLU 4. Better initialization (He init for ReLU)

Vanishing gradients¶

Symptoms: Early layers have tiny gradients, don't learn

Common causes: 1. Deep network without residual connections 2. Sigmoid/tanh saturation 3. Poor initialization

Fixes: 1. Add residual connections: output = input + layer(input) 2. Use pre-norm architecture 3. Use proper initialization (Xavier/He)

Exploding gradients¶

Symptoms: Gradient norms > 100, loss spikes

Fixes: 1. Gradient clipping (always use for transformers!) 2. Lower learning rate 3. Better initialization

Output Quality Issues¶

Model outputs repetitive text¶

Symptoms: Generated text repeats phrases or loops

Common causes: 1. Temperature too low 2. No sampling (pure argmax) 3. Training collapse

Fixes: 1. Increase temperature (try 0.7-1.0) 2. Add nucleus (top-p) sampling 3. Add repetition penalty 4. Check training wasn't corrupted

Model outputs nonsense¶

Symptoms: Generated text is incoherent

Common causes: 1. Undertrained model 2. Tokenization mismatch 3. Wrong model weights loaded 4. Corrupted embeddings

Diagnosis:

# Check loss is reasonable
# Random = log(vocab_size), e.g., ~10.8 for vocab=50000
print(f"Loss: {loss} (random would be {np.log(vocab_size):.1f})")

# Verify tokenization roundtrips
text = "Hello world"
assert tokenizer.decode(tokenizer.encode(text)) == text

See also: Stage 7: Tokenization

Memory Issues¶

Out of memory (OOM)¶

Symptoms: CUDA/system out of memory error

Common causes: 1. Batch size too large 2. Sequence length too long 3. Model too large 4. Accumulating tensors in memory

Fixes: 1. Reduce batch size 2. Use gradient accumulation for effective larger batches 3. Reduce sequence length 4. Use gradient checkpointing 5. Use mixed precision (fp16) 6. Use PEFT methods (LoRA) instead of full fine-tuning

Estimation:

# Rough memory estimate (training, fp32)
params = model.count_parameters()
memory_gb = params * 4 * 4 / 1e9  # weights + grads + optimizer states
print(f"Estimated memory: {memory_gb:.1f} GB")

See also: Stage 9: PEFT

Overfitting¶

Validation loss increasing while training loss decreases¶

Symptoms: Model performs well on training data but poorly on validation

Diagnosis:

# Track both losses
print(f"Train: {train_loss:.4f}, Val: {val_loss:.4f}")
# If gap > 0.3-0.5, likely overfitting

Fixes: 1. More training data 2. Data augmentation 3. Dropout (0.1-0.3) 4. Weight decay (0.01-0.1) 5. Early stopping 6. Reduce model capacity 7. Use PEFT instead of full fine-tuning

See also: Stage 8: Debugging Strategies

Performance Issues¶

Training is very slow¶

Common causes: 1. Not using GPU (if available) 2. Data loading bottleneck 3. Too much logging/checkpointing 4. Inefficient operations

Diagnosis:

import time

t0 = time.time()
for i in range(100):
    batch = next(data_loader)  # Data loading time
t1 = time.time()
output = model(batch)  # Forward time
t2 = time.time()
loss.backward()  # Backward time
t3 = time.time()

print(f"Data: {t1-t0:.3f}s, Forward: {t2-t1:.3f}s, Backward: {t3-t2:.3f}s")

Fixes: 1. Use GPU if available 2. Increase data loader workers 3. Prefetch data 4. Reduce logging frequency 5. Use compiled/fused operations

Tokenization Issues¶

Many `<UNK>` tokens¶

Symptoms: Input has many unknown tokens

Cause: Vocabulary doesn't cover input text

Fixes: 1. Retrain tokenizer on representative data 2. Use byte-level tokenization (handles any input) 3. Increase vocabulary size

Tokenization doesn't roundtrip¶

Symptom: decode(encode(text)) != text

Common causes: 1. Normalization differences (Unicode, case) 2. Whitespace handling 3. Special tokens

Fixes: 1. Normalize consistently 2. Check whitespace handling 3. Handle special tokens explicitly

See also: Stage 7: Common Mistakes

Attention Issues¶

Attention weights are uniform¶

Symptom: All attention weights ≈ 1/seq_len

Causes: 1. Missing scaling by √d_k 2. Poor initialization 3. Undertrained

Fix: Ensure scaling: scores = Q @ K.T / np.sqrt(d_k)

Attention looks at future (in causal model)¶

Symptom: Model "cheats" during training, poor generation

Cause: Missing or incorrect causal mask

Fix:

def causal_mask(seq_len):
    mask = np.triu(np.ones((seq_len, seq_len)), k=1)
    return np.where(mask, float('-inf'), 0)

See also: Stage 5: Common Mistakes

Alignment Issues¶

Reward increases but quality decreases¶

Symptom: Reward hacking - model games the reward

Fixes: 1. Increase KL penalty (β in DPO, KL coefficient in RLHF) 2. Use ensemble of reward models 3. Add output diversity constraints 4. Improve reward model with more data

DPO loss not decreasing¶

Common causes: 1. Learning rate too low (try 1e-6 to 1e-5) 2. β too high (try 0.1) 3. Reference model not frozen 4. Log probability computation wrong

See also: Stage 10: Common Mistakes

Quick Checklist¶

Before asking for help, verify:

[ ] Data is loaded correctly (print a sample)
[ ] Shapes are as expected at each layer
[ ] Loss is computed correctly (cross-entropy for LM)
[ ] Gradients exist and are non-zero
[ ] Learning rate is reasonable (try 1e-4 as default)
[ ] Gradient clipping is enabled (max_norm=1.0)
[ ] Model is in training mode
[ ] Random seeds are set for reproducibility

Getting More Help¶

If this guide doesn't solve your problem:

Check stage-specific common mistakes: Each stage has detailed debugging info
Read the test files: Tests show expected behavior
Simplify: Can you reproduce on a tiny example?
Binary search: Bisect the code to find where it breaks
Ask with details: Include loss values, shapes, and minimal code

Troubleshooting Guide¶

Quick Diagnosis¶

What's your symptom?¶

Numerical Issues¶

Loss becomes NaN¶

Loss Plateau¶

Loss stops decreasing¶

Learning Rate Issues¶

Loss oscillates wildly¶

Loss decreases very slowly¶

Finding the right learning rate¶

Gradient Issues¶

Gradients are all zero¶

Vanishing gradients¶

Exploding gradients¶

Output Quality Issues¶

Model outputs repetitive text¶

Model outputs nonsense¶

Memory Issues¶

Out of memory (OOM)¶

Overfitting¶

Validation loss increasing while training loss decreases¶

Performance Issues¶

Training is very slow¶

Tokenization Issues¶

Many <UNK> tokens¶

Tokenization doesn't roundtrip¶

Attention Issues¶

Attention weights are uniform¶

Attention looks at future (in causal model)¶

Alignment Issues¶

Reward increases but quality decreases¶

DPO loss not decreasing¶

Quick Checklist¶

Getting More Help¶

Many `<UNK>` tokens¶