Section 3.6: Training Dynamics¶
Having a model and a loss function isn't enough. Making neural networks actually learn requires understanding training dynamics—the interplay of learning rates, initialization, and optimization.
This section covers the practical art of training neural language models.
The Optimization Landscape¶
What Are We Optimizing?¶
The loss function L(θ) defines a surface over parameter space. For our language model:
Where θ includes all embeddings, weights, and biases.
Visualizing the Landscape¶
For 2 parameters, we can plot L as a 3D surface:
Real networks have millions of parameters—the surface is in million-dimensional space!
Key Properties¶
Non-convex: Multiple local minima. No guarantee of finding the global minimum.
High-dimensional: In high dimensions, most critical points are saddle points (points where the gradient is zero but the surface curves up in some directions and down in others—neither a minimum nor maximum), not local minima. This is actually good—harder to get stuck.
Flat regions: Some directions have near-zero gradient. Training can plateau.
Gradient Descent: The Basic Algorithm¶
The Update Rule¶
Given current parameters θ and learning rate η:
We move in the direction of steepest descent (negative gradient).
Why It Works¶
Taylor expansion around current point:
To decrease L, we want \(\nabla L \cdot \Delta\theta < 0\).
Choosing \(\Delta\theta = -\eta \nabla L\):
The loss decreases (for small enough η).
Stochastic Gradient Descent (SGD)¶
Computing the gradient over all N examples is expensive. Instead:
- Sample a single example (or mini-batch)
- Compute gradient on that sample
- Update parameters
- Repeat
The gradient estimate is noisy but unbiased:
This noise can actually help escape local minima!
The Learning Rate¶
The learning rate η is the most important hyperparameter.
Too Small¶
- Very slow progress
- May never reach good solution
- Training takes forever
Too Large¶
- Overshoots optimal values
- Loss oscillates or diverges
- Training becomes unstable
Just Right¶
- Steady decrease
- Converges to good minimum
- Can explore and then settle
Finding the Right Learning Rate¶
Rule of thumb: Start with 0.01, adjust by factors of 10.
Learning rate finder: Gradually increase LR, plot loss. Use value just before loss explodes.
Common values:
- 0.1: Often too high for deep nets
- 0.01: Good starting point
- 0.001: Common for fine-tuning
- 0.0001: Very conservative
Our Character LM¶
For our model, try:
- Start: η = 0.1 (aggressive)
- If unstable: reduce to 0.01
- If too slow: increase to 0.5
Learning Rate Schedules¶
A fixed learning rate isn't optimal. Better: change η during training.
Step Decay¶
Reduce by factor every K epochs:
The notation ⌊x⌋ (floor function) means "round down to the nearest integer." For example, ⌊7/3⌋ = ⌊2.33⌋ = 2.
Example: Start at 0.1, multiply by 0.5 every 10 epochs.
def step_decay(initial_lr, epoch, decay_rate=0.5, decay_every=10):
return initial_lr * (decay_rate ** (epoch // decay_every))
Linear Decay¶
Decrease linearly to zero:
Cosine Annealing¶
Smooth decrease following cosine:
Popular in modern training.
Warmup¶
Start with tiny learning rate, gradually increase:
Then decay. Helps stabilize early training when gradients are large.
Initialization¶
How we initialize parameters affects training dramatically.
The Problem with Zeros¶
If all weights are zero:
- All neurons compute the same thing
- All gradients are the same
- Symmetry is never broken
Never initialize weights to zero! (Biases to zero are okay.)
Random Initialization¶
Simple approach: small random values.
But what should σ be?
The Variance Problem¶
Consider a single layer: y = Wx where x ∈ ℝⁿ.
If \(x_i\) has variance \(\text{Var}(x)\) and \(W_{ij}\) has variance σ²:
The variance grows by factor n!
For deep networks, this compounds: variance explodes or vanishes.
Xavier/Glorot Initialization¶
Solution: scale by both input and output dimensions.
Or for uniform distribution:
This balances variance preservation in both forward and backward passes.
Simplified view (forward pass only): If we just want forward variance preserved:
(Approximately preserved when n_in ≈ n_out.)
He Initialization¶
For ReLU activations, Xavier underestimates because ReLU zeros out approximately half the neurons (those with negative input), effectively halving the variance of activations.
He initialization compensates for this:
The factor of 2 (compared to 1/n_in) compensates for ReLU zeroing half the activations, maintaining proper variance flow through the network.
Our Implementation¶
def init_weights(shape, activation='relu'):
"""Initialize weight matrix with appropriate scaling."""
n_in, n_out = shape
if activation == 'relu':
scale = (2.0 / n_in) ** 0.5 # He
else:
scale = (2.0 / (n_in + n_out)) ** 0.5 # Xavier
return [[Value(random.gauss(0, scale))
for _ in range(n_in)]
for _ in range(n_out)]
Batching¶
Processing one example at a time is inefficient and noisy.
Key terminology:
- Mini-batch: A subset of training examples processed together before updating parameters
- Epoch: One complete pass through all training examples
- Iteration/Step: One parameter update (processing one mini-batch)
If you have 1000 training examples and batch size 100, then 1 epoch = 10 iterations.
Mini-Batch Gradient Descent¶
Process B examples together:
Advantages:
- More stable gradients (averaging reduces variance)
- Computational efficiency (parallelism)
- Better generalization (noise helps)
Common batch sizes: 32, 64, 128, 256
Trade-offs¶
| Batch Size | Gradient Variance | Computation | Generalization |
|---|---|---|---|
| 1 | Very high | Slow | Good |
| 32-128 | Medium | Fast | Good |
| 1000+ | Low | Very fast | May overfit |
Implementation¶
def create_batches(examples, batch_size):
"""Split examples into mini-batches."""
random.shuffle(examples)
batches = []
for i in range(0, len(examples), batch_size):
batches.append(examples[i:i + batch_size])
return batches
def train_batch(model, batch, learning_rate):
"""Train on a single mini-batch."""
params = model.parameters()
# Forward pass and accumulate loss
total_loss = Value(0.0)
for context, target in batch:
loss = model.loss(context, target)
total_loss = total_loss + loss
avg_loss = total_loss / len(batch)
# Zero gradients
for p in params:
p.grad = 0.0
# Backward pass
avg_loss.backward()
# Update
for p in params:
p.data -= learning_rate * p.grad
return avg_loss.data
Overfitting and Regularization¶
The Overfitting Problem¶
With enough parameters, the model can memorize training data perfectly—but fail on new data.
Signs of overfitting:
- Training loss keeps decreasing
- Validation loss starts increasing
- Large gap between train and validation loss
Loss
│\
│ \ training
│ \____________________
│ ╱
│ / validation
│ /''·····
└──────────────────────── Epochs
Train/Validation Split¶
Always evaluate on held-out data:
def split_data(examples, val_fraction=0.1):
"""Split examples into train and validation."""
n_val = int(len(examples) * val_fraction)
random.shuffle(examples)
return examples[n_val:], examples[:n_val]
Early Stopping¶
Stop training when validation loss stops improving:
def train_with_early_stopping(model, train_examples, val_examples,
patience=5, max_epochs=100):
best_val_loss = float('inf')
epochs_without_improvement = 0
for epoch in range(max_epochs):
# Train one epoch
train_loss = train_epoch(model, train_examples)
# Evaluate
val_loss = evaluate(model, val_examples)
print(f"Epoch {epoch+1}: train={train_loss:.4f}, val={val_loss:.4f}")
# Check for improvement
if val_loss < best_val_loss:
best_val_loss = val_loss
epochs_without_improvement = 0
# Save best model weights here
else:
epochs_without_improvement += 1
# Early stop
if epochs_without_improvement >= patience:
print(f"Early stopping at epoch {epoch+1}")
break
Weight Decay (L2 Regularization)¶
Add penalty for large weights:
This encourages smaller weights, reducing overfitting.
def apply_weight_decay(params, learning_rate, weight_decay):
"""Apply L2 regularization."""
for p in params:
p.data -= learning_rate * weight_decay * p.data
In practice, combine with gradient update:
Dropout¶
During training, randomly zero some activations:
Where \(m_i \sim \text{Bernoulli}(1 - p)\) and p is the dropout probability.
At test time, scale by (1-p) or use all activations.
def dropout(x, p=0.5, training=True):
"""Apply dropout to list of Values."""
if not training:
return x
mask = [1 if random.random() > p else 0 for _ in x]
scale = 1.0 / (1.0 - p) # Scale to maintain expected value
return [v * m * scale for v, m in zip(x, mask)]
Monitoring Training¶
What to Track¶
- Training loss: Should decrease
- Validation loss: Should decrease, watch for divergence from training
- Perplexity: exp(loss), more interpretable
- Gradient norms: Should be stable, not exploding/vanishing
- Parameter norms: Shouldn't grow unboundedly
Implementation¶
def compute_gradient_norm(params):
"""Compute L2 norm of all gradients."""
total = sum(p.grad ** 2 for p in params)
return total ** 0.5
def compute_param_norm(params):
"""Compute L2 norm of all parameters."""
total = sum(p.data ** 2 for p in params)
return total ** 0.5
What to Watch For¶
| Symptom | Likely Cause | Solution |
|---|---|---|
| Loss stays flat | LR too small, or stuck | Increase LR, reinitialize |
| Loss explodes | LR too large | Reduce LR, gradient clipping |
| Val > Train | Overfitting | Regularization, early stopping |
| Loss oscillates | LR too large | Reduce LR |
| Gradients → 0 | Vanishing gradients | Better init, skip connections |
| Gradients → ∞ | Exploding gradients | Gradient clipping, smaller LR |
Gradient Clipping¶
Prevent exploding gradients by capping the gradient norm:
def clip_gradients(params, max_norm):
"""Clip gradients to maximum norm."""
total_norm = compute_gradient_norm(params)
if total_norm > max_norm:
scale = max_norm / total_norm
for p in params:
p.grad *= scale
This is especially important for language models, where certain inputs can cause large gradients.
A Complete Training Function¶
Putting it all together:
def train_model(model, train_data, val_data, config):
"""
Complete training loop with all best practices.
config: dict with hyperparameters
- epochs: max training epochs
- batch_size: mini-batch size
- learning_rate: initial learning rate
- weight_decay: L2 regularization strength
- max_grad_norm: gradient clipping threshold
- patience: early stopping patience
"""
params = model.parameters()
best_val_loss = float('inf')
patience_counter = 0
for epoch in range(config['epochs']):
# Learning rate schedule (linear decay)
lr = config['learning_rate'] * (1 - epoch / config['epochs'])
# Training
model.train_mode = True
batches = create_batches(train_data, config['batch_size'])
train_loss = 0.0
for batch in batches:
# Forward and backward
batch_loss = train_batch(model, batch, lr)
# Gradient clipping
clip_gradients(params, config['max_grad_norm'])
# Weight decay
apply_weight_decay(params, lr, config['weight_decay'])
train_loss += batch_loss
train_loss /= len(batches)
# Validation
model.train_mode = False
val_loss = evaluate(model, val_data)
# Logging
train_ppl = math.exp(train_loss)
val_ppl = math.exp(val_loss)
print(f"Epoch {epoch+1}: "
f"train_loss={train_loss:.4f} (PPL={train_ppl:.2f}), "
f"val_loss={val_loss:.4f} (PPL={val_ppl:.2f})")
# Early stopping check
if val_loss < best_val_loss:
best_val_loss = val_loss
patience_counter = 0
# Save best model
else:
patience_counter += 1
if patience_counter >= config['patience']:
print(f"Early stopping at epoch {epoch+1}")
break
return model
Summary¶
| Concept | Description | Practical Tip |
|---|---|---|
| Learning rate | Step size for updates | Start at 0.01, adjust |
| LR schedule | Change LR over time | Decay helps convergence |
| Initialization | Starting parameter values | Use He for ReLU |
| Batch size | Examples per update | 32-128 typical |
| Weight decay | L2 regularization | 1e-4 to 1e-2 |
| Gradient clipping | Prevent explosion | Max norm 1-5 |
| Early stopping | Prevent overfitting | Patience 5-10 |
Key insight: Training neural networks is empirical. Start with defaults, monitor carefully, adjust based on what you observe. There's no substitute for running experiments.
Exercises¶
-
Learning rate experiment: Train the model with learning rates 0.001, 0.01, 0.1, and 1.0. Plot the training curves. What do you observe?
-
Initialization comparison: Compare training with Xavier init vs. random N(0, 1). How long until each converges?
-
Batch size trade-off: Train with batch sizes 1, 16, 64, and 256. Compare wall-clock time to reach the same loss.
-
Early stopping: Implement early stopping and compare final validation loss with and without it.
-
Gradient analysis: Add logging for gradient norms. At what point in training are gradients largest?
What's Next¶
We can train our model. But how good is it really?
In Section 3.7, we'll evaluate our neural language model and compare it directly to the Markov models from Stage 1. We'll see concrete evidence of the neural advantage.