Stage 3: Common Mistakes¶

Mistake 1: Softmax on Wrong Axis¶

Symptom: Probabilities sum to 1 across batch instead of vocabulary

Wrong code:

logits = model(x)  # Shape: [batch, vocab]
probs = softmax(logits, axis=0)  # Wrong! Sums across batch

The fix:

probs = softmax(logits, axis=-1)  # Correct: sums across vocabulary

Verify:

assert np.allclose(probs.sum(axis=-1), 1.0)  # Each sample sums to 1

Mistake 2: Embedding Gradient Accumulation¶

Symptom: Embeddings don't update correctly for repeated tokens

Wrong code:

def backward(self, grad, indices):
    self.W_grad[indices] = grad  # Overwrites instead of accumulates!

Example: If "the" appears twice, we need to sum both gradients.

The fix:

def backward(self, grad, indices):
    np.add.at(self.W_grad, indices, grad)  # Accumulates correctly

Mistake 3: Forgetting Bias Terms¶

Symptom: Model has trouble learning constant offsets

Wrong:

hidden = x @ W  # No bias!

The fix:

hidden = x @ W + b  # Include bias

Mistake 4: Not Flattening Context¶

Symptom: Shape mismatch between embedding and linear layer

Example: Context of 3 words with 64-dim embeddings should give 192-dim input.

Wrong:

embedded = embeddings[context]  # Shape: [3, 64]
hidden = embedded @ W1  # W1 expects [batch, 64]! Shape error

The fix:

embedded = embeddings[context]  # Shape: [3, 64]
embedded_flat = embedded.flatten()  # Shape: [192]
hidden = embedded_flat @ W1  # W1 is [192, hidden_dim]

Mistake 5: Log of Zero in Cross-Entropy¶

Symptom: Loss is NaN or Inf

Wrong code:

loss = -np.sum(targets * np.log(probs))  # log(0) = -inf!

The fix: Add small epsilon or use log-softmax

loss = -np.sum(targets * np.log(probs + 1e-10))
# Or better:
log_probs = log_softmax(logits)
loss = -np.sum(targets * log_probs)

Mistake 6: Wrong Initialization Scale¶

Symptom: Gradients vanish or explode from the start

Too small:

W = np.random.randn(in_dim, out_dim) * 0.0001  # Vanishing

Too large:

W = np.random.randn(in_dim, out_dim) * 10  # Exploding

The fix: Use Xavier or He initialization

# Xavier (for tanh/sigmoid)
W = np.random.randn(in_dim, out_dim) * np.sqrt(1.0 / in_dim)

# He (for ReLU)
W = np.random.randn(in_dim, out_dim) * np.sqrt(2.0 / in_dim)

Mistake 7: ReLU Killing Gradients¶

Symptom: Many neurons output 0 and never recover

Problem: ReLU outputs 0 for negative inputs, gradient is also 0.

Signs:

activations = relu(hidden)
print(f"Dead neurons: {(activations == 0).mean():.1%}")
# If > 20%, something's wrong

Fixes: - Use LeakyReLU: max(0.01*x, x) - Use better initialization - Reduce learning rate

Mistake 8: Not Normalizing Input¶

Symptom: Unstable training, sensitivity to input scale

Example: If some tokens are represented as 0-100 and others as 0-1, the model struggles.

The fix: Embeddings should be roughly unit variance

# Check embedding statistics
print(f"Embedding mean: {embeddings.mean():.3f}")  # Should be ~0
print(f"Embedding std: {embeddings.std():.3f}")   # Should be ~1

Mistake 9: Evaluating on Training Data¶

Symptom: Perplexity looks great but model doesn't generalize

Wrong:

model.train(data)
perplexity = model.evaluate(data)  # Same data!
print(f"Perplexity: {perplexity}")  # Misleadingly low

The fix: Always use separate test data

train_data, test_data = split(data, ratio=0.9)
model.train(train_data)
perplexity = model.evaluate(test_data)  # Different data

Mistake 10: Batch Size Confusion¶

Symptom: Shapes work for batch_size=1 but fail for larger batches

Wrong:

# Only works for single samples
def forward(self, x):
    return x @ self.W  # Assumes x is 1D

The fix: Always think in batches

def forward(self, x):
    # x is [batch, features]
    return x @ self.W  # Works for any batch size

Tip: Test with batch_size=1 AND batch_size=32 to catch issues.