Stage 1: Common Mistakes¶
Mistake 1: Confusing Joint and Conditional Probability¶
Wrong thinking: "P(the, cat) is the probability of seeing 'the cat'"
Correct thinking: P(the, cat) is the joint probability of two events. For language modeling, we want P(cat | the) — the probability of "cat" given we've seen "the".
The fix:
# Wrong: treating joint as conditional
p = count("the cat") / total_bigrams
# Correct: conditional probability
p = count("the cat") / count("the *")
Mistake 2: Zero Probability = Model Broken¶
Symptom: Model assigns P=0 to valid words, causing log(0) = -inf
Example:
>>> model.probability("elephant", context="the")
0.0 # Never seen "the elephant" in training
>>> math.log(0.0)
-inf # Perplexity calculation breaks!
The fix: Always use smoothing
Mistake 3: Not Handling Unknown Words¶
Symptom: KeyError when encountering words not in vocabulary
The fix: Add an <UNK> token
def probability(self, word, context):
if word not in self.vocab:
word = "<UNK>"
if context not in self.vocab:
context = "<UNK>"
# ... rest of calculation
Mistake 4: Temperature of 0¶
Wrong code:
def sample_with_temperature(probs, temperature):
scaled = probs ** (1 / temperature) # Division by zero when T=0!
return scaled / sum(scaled)
The fix: Handle T=0 as argmax
def sample_with_temperature(probs, temperature):
if temperature == 0:
return np.argmax(probs)
scaled = probs ** (1 / temperature)
return scaled / sum(scaled)
Mistake 5: Log Probability Underflow¶
Symptom: Probability of long sequences becomes 0
# Product of many small probabilities
p = 0.1 * 0.1 * 0.1 * ... * 0.1 # 100 terms
p = 1e-100 # Underflows to 0!
The fix: Work in log space
# Sum of log probabilities
log_p = sum(math.log(p) for p in probabilities)
# Convert back only when needed
p = math.exp(log_p)
Mistake 6: Not Shuffling Data¶
Symptom: Model only learns patterns from the beginning of text
Wrong:
The fix: Shuffle training examples (for SGD-based training)
Mistake 7: Perplexity on Training Data¶
Symptom: Amazing perplexity that doesn't generalize
Wrong thinking: "My model has perplexity 5 on training data!"
Correct thinking: Perplexity on training data is meaningless. Always evaluate on held-out test data.
# Correct evaluation
train_text, test_text = split_data(corpus, ratio=0.9)
model.train(train_text)
perplexity = evaluate(model, test_text) # Use TEST data
Mistake 8: Context Boundary Handling¶
Symptom: Predictions at start of text are wrong
Example: For bigram model, what's P(first_word | ???)
The fix: Add special start/end tokens
Mistake 9: Case Sensitivity¶
Symptom: "The" and "the" treated as different words
This fragments your counts and makes the model worse.
The fix (usually): Lowercase everything
But be aware this loses information ("US" vs "us").
Mistake 10: Not Normalizing Probabilities¶
Symptom: Probabilities don't sum to 1
Common cause: Off-by-one errors in counting
The fix: Verify normalization