Section 3.4: Cross-Entropy Loss and Maximum Likelihood¶
We have a neural network that outputs a probability distribution. But how do we train it?
Cross-entropy loss is the answer. In this section, we'll derive it from first principles and prove it's equivalent to maximum likelihood estimation—connecting back to Stage 1.
The Training Problem¶
What We Have¶
A neural language model that computes:
Where:
- \(c_{t-k:t-1}\) is the context (previous k characters)
- \(c_t\) is the next character
- θ are all the model parameters (embeddings, weights, biases)
- \(\hat{P}\) is the model's predicted probability distribution
What We Want¶
Parameters θ* that make the model's predictions match the true data distribution as closely as possible.
The Fundamental Question¶
How do we measure "how wrong" the model's predictions are?
Maximum Likelihood: The Principled Approach¶
From Stage 1¶
In Stage 1, we derived maximum likelihood estimation for n-gram models:
The optimal parameters are those that maximize the probability of the observed data.
For Neural Language Models¶
The principle is exactly the same!
Given training data \(D = \{(x_1, y_1), ..., (x_N, y_N)\}\) where:
- \(x_i\) is the i-th context
- \(y_i\) is the true next character
The likelihood is:
(Assuming independence between examples.)
Log-Likelihood¶
Products are numerically unstable and hard to differentiate. Take logarithms:
This is the log-likelihood.
From Maximization to Minimization¶
ML practitioners prefer minimizing a loss. Negating:
This is the negative log-likelihood.
Minimizing NLL = Maximizing likelihood.
Average Loss¶
For numerical stability and comparison across dataset sizes, use the average:
This is our training objective.
Cross-Entropy: The Information Theory View¶
Cross-Entropy Definition¶
Given true distribution p and model distribution q, the cross-entropy is:
It measures the expected number of bits needed to encode samples from p using a code optimized for q.
For Language Modeling¶
At each position in our data:
- True distribution: p(c_t | context) = 1 if c_t is the actual next character, else 0 (one-hot)
- Model distribution: \(q(c) = \hat{P}(c | \text{context}; \theta)\)
The cross-entropy at this position:
Since p(c) = 1 for c = y (the true character) and 0 otherwise:
This is exactly the negative log-probability!
The Equivalence¶
Cross-entropy loss per example = Negative log-likelihood per example.
For the dataset:
This is the same as average NLL!
Why Cross-Entropy Is the Right Loss¶
Reason 1: Maximum Likelihood¶
As shown, minimizing cross-entropy = maximizing likelihood.
MLE has strong theoretical justification:
- Consistent (converges to true parameters with infinite data)
- Asymptotically efficient (lowest variance among consistent estimators)
- Invariant under reparametrization
Reason 2: Proper Scoring Rule¶
A scoring rule S(q, y) is proper if:
For any distribution q. That is, the best expected score is achieved when q = p.
Cross-entropy is proper.
Proof:
Expected cross-entropy when true distribution is p:
This is minimized when q = p, giving the entropy H(p):
For any q ≠ p, we have H(p, q) > H(p, p) by Gibbs' inequality.
Reason 3: Information-Theoretic Interpretation¶
Cross-entropy H(p, q) measures the inefficiency of using code for q when the true distribution is p.
Minimizing cross-entropy = finding the most efficient encoding of the data.
Reason 4: Gradient Properties¶
We'll see shortly that cross-entropy combined with softmax has remarkably clean gradients.
The Loss Function Explicitly¶
For a Single Example¶
Given context x and true next character y:
Recall that the model computes:
Where z are the logits (outputs of the final linear layer).
Substituting:
Simplifying¶
This is the softmax log-likelihood formula.
The second term is the log-sum-exp (LSE) function:
For a Batch¶
For N examples:
Where \(z^{(i)}\) are the logits for example i, and \(y_i\) is the true character index.
Computing the Gradient¶
Why We Need the Gradient¶
To train via gradient descent, we need:
For every parameter θ. Our Stage 2 autograd will compute this automatically, but understanding the gradient structure is valuable.
Gradient w.r.t. Logits¶
The most important gradient: ∂L/∂z.
For a single example with true class y:
For the logit of the true class:
For any other logit \(z_c\) where c ≠ y:
The Beautiful Result¶
For all classes c:
Where \(\delta_{cy}\) is 1 if c = y (the true class), else 0.
In vector form:
The gradient is simply: predicted probability minus true probability!
Why This Is Beautiful¶
- Magnitude of gradient = how wrong the prediction is
- If \(\hat{P}(y) = 1\): gradient is 0 (perfect prediction)
- If \(\hat{P}(y) = 0\): gradient is -1 (maximally wrong)
- Automatically scaled by confidence
This natural weighting makes gradient descent effective.
Deriving the Softmax-CrossEntropy Gradient¶
Let's prove the result rigorously.
Setup¶
Given logits \(z \in \mathbb{R}^{|V|}\) and true class y:
Where \(Z = \sum_c e^{z_c}\) (partition function).
Loss:
Derivative of Partition Function¶
Derivative of Loss w.r.t. z_y¶
Derivative w.r.t. Other Logits¶
For c ≠ y:
Combined Result¶
Which is exactly:
QED.
Connection to Perplexity (from Stage 1)¶
Recall Perplexity¶
In Stage 1, we defined perplexity as:
The Relationship¶
The exponent is exactly the average cross-entropy loss!
Where L is the cross-entropy loss.
Why This Matters¶
Training: minimize L (cross-entropy)
Evaluation: report exp(L) (perplexity)
Same underlying metric, different presentations:
- L ∈ [0, ∞): additive, for optimization
- PPL ∈ [1, ∞): interpretable, for humans
Interpreting the Loss¶
If L = 2.3:
- PPL = exp(2.3) ≈ 10
- Model is "as uncertain as picking uniformly among 10 options"
- Lower is better
Implementing Cross-Entropy Loss¶
Using our Stage 2 autograd:
def cross_entropy_loss(logits, target_idx):
"""
logits: list of Value objects (unnormalized scores)
target_idx: int (index of true class)
Returns: Value (scalar loss)
"""
# Log-sum-exp for numerical stability
max_logit = max(v.data for v in logits)
# Subtract max for stability (doesn't change result)
shifted = [v - max_logit for v in logits]
# exp of shifted logits
exp_logits = [v.exp() for v in shifted]
# Sum
sum_exp = sum(exp_logits, Value(0.0))
# Log-sum-exp
lse = sum_exp.log() + max_logit
# Loss = -logit[target] + lse
loss = lse - logits[target_idx]
return loss
Why LogSumExp?¶
Direct computation \(\log(\sum e^{z_i})\) can overflow/underflow.
Using LSE trick:
After subtracting max, all exponents are ≤ 0, preventing overflow.
Multiple Examples: Batch Loss¶
For a batch of N examples:
def batch_cross_entropy(batch_logits, batch_targets):
"""
batch_logits: list of lists of Values
batch_targets: list of target indices
Returns: Value (average loss)
"""
losses = [cross_entropy_loss(logits, target)
for logits, target in zip(batch_logits, batch_targets)]
# Average
total = sum(losses, Value(0.0))
return total / len(losses)
Summary¶
| Concept | Formula | Interpretation |
|---|---|---|
| MLE objective | max log P(data|θ) | Most likely parameters |
| NLL loss | -log P(data|θ) | Minimize to maximize likelihood |
| Cross-entropy | -Σ p log q | Expected bits using wrong code |
| Per-example loss | -log P(y|x) | Surprise at true outcome |
| Softmax gradient | \(\hat{p} - p\) | Predicted - true |
| Perplexity | exp(loss) | Interpretable uncertainty |
Key insights:
- Cross-entropy = MLE: Same theoretical foundation as Stage 1
- Proper scoring rule: Best achievable when q = p
- Beautiful gradient: \(\hat{p} - p\) is simple and effective
- Connection to perplexity: exp(loss) gives interpretable metric
Exercises¶
-
Verify equivalence: Show that for one-hot true distribution, H(p,q) = -log q(y).
-
Compute loss: Given logits [2.0, 1.0, 3.0] and true class 2, compute the cross-entropy loss by hand.
-
Gradient check: For the same logits and true class, compute ∂L/∂z for each logit. Verify the formula.
-
Perplexity: If cross-entropy loss is 1.5, what is the perplexity?
-
Softmax temperature: What happens to the loss if we compute softmax(z/T) for T → 0? For T → ∞?
What's Next¶
We have:
- Embeddings (Section 3.2)
- Feed-forward networks (Section 3.3)
- Cross-entropy loss (Section 3.4)
Time to put it all together!
In Section 3.5, we'll implement a complete character-level neural language model using our Stage 2 autograd system.