Stage 3: Common Mistakes¶
Mistake 1: Softmax on Wrong Axis¶
Symptom: Probabilities sum to 1 across batch instead of vocabulary
Wrong code:
logits = model(x) # Shape: [batch, vocab]
probs = softmax(logits, axis=0) # Wrong! Sums across batch
The fix:
Verify:
Mistake 2: Embedding Gradient Accumulation¶
Symptom: Embeddings don't update correctly for repeated tokens
Wrong code:
Example: If "the" appears twice, we need to sum both gradients.
The fix:
Mistake 3: Forgetting Bias Terms¶
Symptom: Model has trouble learning constant offsets
Wrong:
The fix:
Mistake 4: Not Flattening Context¶
Symptom: Shape mismatch between embedding and linear layer
Example: Context of 3 words with 64-dim embeddings should give 192-dim input.
Wrong:
embedded = embeddings[context] # Shape: [3, 64]
hidden = embedded @ W1 # W1 expects [batch, 64]! Shape error
The fix:
embedded = embeddings[context] # Shape: [3, 64]
embedded_flat = embedded.flatten() # Shape: [192]
hidden = embedded_flat @ W1 # W1 is [192, hidden_dim]
Mistake 5: Log of Zero in Cross-Entropy¶
Symptom: Loss is NaN or Inf
Wrong code:
The fix: Add small epsilon or use log-softmax
loss = -np.sum(targets * np.log(probs + 1e-10))
# Or better:
log_probs = log_softmax(logits)
loss = -np.sum(targets * log_probs)
Mistake 6: Wrong Initialization Scale¶
Symptom: Gradients vanish or explode from the start
Too small:
Too large:
The fix: Use Xavier or He initialization
# Xavier (for tanh/sigmoid)
W = np.random.randn(in_dim, out_dim) * np.sqrt(1.0 / in_dim)
# He (for ReLU)
W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)
Mistake 7: ReLU Killing Gradients¶
Symptom: Many neurons output 0 and never recover
Problem: ReLU outputs 0 for negative inputs, gradient is also 0.
Signs:
activations = relu(hidden)
print(f"Dead neurons: {(activations == 0).mean():.1%}")
# If > 20%, something's wrong
Fixes:
- Use LeakyReLU: max(0.01*x, x)
- Use better initialization
- Reduce learning rate
Mistake 8: Not Normalizing Input¶
Symptom: Unstable training, sensitivity to input scale
Example: If some tokens are represented as 0-100 and others as 0-1, the model struggles.
The fix: Embeddings should be roughly unit variance
# Check embedding statistics
print(f"Embedding mean: {embeddings.mean():.3f}") # Should be ~0
print(f"Embedding std: {embeddings.std():.3f}") # Should be ~1
Mistake 9: Evaluating on Training Data¶
Symptom: Perplexity looks great but model doesn't generalize
Wrong:
model.train(data)
perplexity = model.evaluate(data) # Same data!
print(f"Perplexity: {perplexity}") # Misleadingly low
The fix: Always use separate test data
train_data, test_data = split(data, ratio=0.9)
model.train(train_data)
perplexity = model.evaluate(test_data) # Different data
Mistake 10: Batch Size Confusion¶
Symptom: Shapes work for batch_size=1 but fail for larger batches
Wrong:
The fix: Always think in batches
Tip: Test with batch_size=1 AND batch_size=32 to catch issues.