Section 4.5: Adaptive Learning Rates — Per-Parameter Tuning¶
Reading time: 22 minutes | Difficulty: ★★★★☆
Different parameters may need different learning rates. Adaptive methods automatically adjust the learning rate for each parameter based on the history of gradients it has seen.
The Motivation¶
Consider a neural network with:
- Embedding weights: Sparse updates (most gradients are zero)
- Output layer: Dense updates (every example affects it)
Using the same learning rate for both is suboptimal:
- Embeddings need larger steps (rare updates)
- Output layer needs smaller steps (frequent updates)
Idea: Scale the learning rate inversely with how much each parameter has been updated.
AdaGrad: The Foundation¶
Duchi, Hazan, Singer (2011)
AdaGrad tracks the sum of squared gradients for each parameter:
Where:
- \(g_t = \nabla L(\theta_t)\) is the gradient
- \(G_t\) accumulates squared gradients (element-wise)
- \(\epsilon \approx 10^{-8}\) prevents division by zero
Why AdaGrad Works¶
- Frequently updated parameters: Large \(G_t\) → small effective learning rate
- Rarely updated parameters: Small \(G_t\) → large effective learning rate
This is perfect for sparse problems like NLP where some words appear rarely!
The Problem with AdaGrad¶
\(G_t\) only grows. Eventually, the denominator becomes so large that learning stops:
This is called learning rate decay to zero. Good for convex problems, bad for neural networks.
def adagrad(loss_fn, grad_fn, theta_init, lr, num_steps, eps=1e-8):
"""AdaGrad optimizer."""
theta = theta_init.copy()
G = np.zeros_like(theta) # Accumulated squared gradients
for t in range(num_steps):
grad = grad_fn(theta)
# Accumulate squared gradients
G += grad ** 2
# Update with adaptive learning rate
theta = theta - lr * grad / (np.sqrt(G) + eps)
return theta
RMSprop: Exponential Decay¶
Hinton (2012) — From Coursera lecture, never formally published!
RMSprop fixes AdaGrad's decay problem by using an exponential moving average:
The key change: \((1-\beta)\) weight on new gradient prevents unbounded growth.
Typical: β = 0.9 (average over ~10 recent gradients)
def rmsprop(loss_fn, grad_fn, theta_init, lr, beta=0.9, num_steps=1000, eps=1e-8):
"""RMSprop optimizer."""
theta = theta_init.copy()
v = np.zeros_like(theta) # Running average of squared gradients
for t in range(num_steps):
grad = grad_fn(theta)
# Update running average of squared gradients
v = beta * v + (1 - beta) * grad ** 2
# Update parameters
theta = theta - lr * grad / (np.sqrt(v) + eps)
return theta
Adam: Combining Momentum + RMSprop¶
Kingma & Ba (2014)
Adam combines the best of both worlds:
- Momentum (first moment): Smooths gradient direction
- RMSprop (second moment): Adapts learning rate per parameter
The Adam Algorithm¶
First moment (momentum): $\(m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\)$
Second moment (RMSprop): $\(v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2\)$
Bias correction (critical!): $\(\hat{m}_t = \frac{m_t}{1-\beta_1^t}\)$ $\(\hat{v}_t = \frac{v_t}{1-\beta_2^t}\)$
Update: $\(\theta_{t+1} = \theta_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}\)$
Default Hyperparameters¶
The original paper recommends:
- η = 0.001 (learning rate)
- β₁ = 0.9 (momentum decay)
- β₂ = 0.999 (RMSprop decay)
- ε = 10⁻⁸ (numerical stability)
These work well for most problems!
Why Bias Correction?¶
At t = 1:
- \(m_1 = (1-\beta_1) g_1\) ← biased toward 0
- \(v_1 = (1-\beta_2) g_1^2\) ← biased toward 0
The correction \(\frac{1}{1-\beta^t}\) accounts for the zero initialization:
| Step | 1-β₁ᵗ (β₁=0.9) | Correction factor |
|---|---|---|
| 1 | 0.1 | 10× |
| 5 | 0.41 | 2.4× |
| 10 | 0.65 | 1.5× |
| 100 | ~1.0 | ~1× |
Early steps get boosted; later steps are unaffected.
Implementation¶
class Adam:
"""Adam optimizer from scratch."""
def __init__(self, lr=1e-3, betas=(0.9, 0.999), eps=1e-8):
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.t = 0
self.m = None # First moment
self.v = None # Second moment
def step(self, theta, grad):
"""Perform one optimization step."""
# Initialize moments on first call
if self.m is None:
self.m = np.zeros_like(theta)
self.v = np.zeros_like(theta)
self.t += 1
# Update biased moments
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
# Bias correction
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
# Update parameters
theta = theta - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
return theta
AdamW: Weight Decay Done Right¶
Loshchilov & Hutter (2017)
L2 regularization is typically implemented as: $\(L_{reg}(\theta) = L(\theta) + \frac{\lambda}{2}\|\theta\|^2\)$
This adds λθ to the gradient. But in Adam, this gets scaled by the adaptive learning rate!
AdamW fixes this by applying weight decay directly to parameters:
The weight decay term \((1 - \eta\lambda)\) is applied before the adaptive update.
class AdamW:
"""AdamW: Adam with decoupled weight decay."""
def __init__(self, lr=1e-3, betas=(0.9, 0.999), eps=1e-8, weight_decay=0.01):
self.lr = lr
self.beta1, self.beta2 = betas
self.eps = eps
self.weight_decay = weight_decay
self.t = 0
self.m = None
self.v = None
def step(self, theta, grad):
if self.m is None:
self.m = np.zeros_like(theta)
self.v = np.zeros_like(theta)
self.t += 1
# Weight decay (decoupled from gradient)
theta = theta * (1 - self.lr * self.weight_decay)
# Standard Adam update
self.m = self.beta1 * self.m + (1 - self.beta1) * grad
self.v = self.beta2 * self.v + (1 - self.beta2) * grad ** 2
m_hat = self.m / (1 - self.beta1 ** self.t)
v_hat = self.v / (1 - self.beta2 ** self.t)
theta = theta - self.lr * m_hat / (np.sqrt(v_hat) + self.eps)
return theta
Connection to Modern LLMs
AdamW is the standard optimizer for LLM training.
Typical settings for models like LLaMA: - Learning rate: 1e-4 to 3e-4 - β₁ = 0.9, β₂ = 0.95 (slightly lower β₂ than default) - Weight decay: 0.1 - ε = 1e-8
The lower β₂ = 0.95 allows faster adaptation to changing gradient magnitudes during training.
Comparison of Methods¶
| Method | First Moment | Second Moment | Key Feature |
|---|---|---|---|
| SGD | ✗ | ✗ | Baseline |
| Momentum | ✓ (EMA) | ✗ | Velocity |
| AdaGrad | ✗ | ✓ (sum) | Sparse updates |
| RMSprop | ✗ | ✓ (EMA) | Non-decaying adaptive |
| Adam | ✓ (EMA) | ✓ (EMA) | Best of both |
| AdamW | ✓ (EMA) | ✓ (EMA) | Proper weight decay |
When to Use What¶
Use SGD + Momentum when:¶
- Training is stable
- You want maximum control
- Generalization is critical (SGD often generalizes better)
- You have time to tune learning rate schedule
Use Adam/AdamW when:¶
- You want fast, reliable convergence
- Training large models
- Limited time for hyperparameter tuning
- Learning rate is sensitive
The SGD vs Adam Debate¶
Research shows:
- Adam converges faster in terms of training loss
- SGD often generalizes better on held-out data
- For LLMs, Adam wins due to scale and complexity
Training Loss
↑
│ SGD
│ ╲
│ ╲
│ ╲────────
│ Adam ╲
│ ╲
│ ╲─────
└──────────────────→ Steps
Adam reaches low loss faster,
but SGD may find flatter minima
Understanding Adaptive Methods Geometrically¶
Consider gradient descent in 2D with different scales:
Without adaptation: With adaptation:
(rescaled space)
y (large scale) y' = y/σy
↑ ↑
│ ↗↙↗↙ │ ↘
│ ↗↙ │ ↘
│ ↗↙ │ ↘
└────────→ x └────────→ x' = x/σx
(small scale)
Zigzag path Direct path
Adaptive methods effectively rescale the parameter space so all directions have similar curvature.
The Warmup Mystery¶
Adam often benefits from learning rate warmup (starting with small η, gradually increasing).
Why? The second moment estimate v is unreliable early in training: - Few samples seen - Bias correction helps but isn't perfect - Large initial steps can be destabilizing
Warmup gives time for v to stabilize before taking large steps.
def warmup_lr(step, warmup_steps, base_lr):
"""Linear warmup schedule."""
if step < warmup_steps:
return base_lr * step / warmup_steps
return base_lr
Numerical Stability¶
The ε Parameter¶
Without ε, we'd divide by zero when v ≈ 0. But ε also affects the effective learning rate:
- If √v >> ε: ε doesn't matter
- If √v ≈ ε: ε significantly affects the update
For parameters with tiny gradients, ε can dominate!
Gradient Explosion¶
Even with adaptive rates, gradients can explode. Always use gradient clipping:
def clip_grad_norm(grads, max_norm):
"""Clip gradient norm."""
total_norm = np.sqrt(sum(np.sum(g**2) for g in grads))
clip_coef = max_norm / (total_norm + 1e-6)
if clip_coef < 1:
return [g * clip_coef for g in grads]
return grads
Complete Training Loop¶
def train_with_adam(model, data, epochs, lr=1e-3, weight_decay=0.01):
"""Complete training loop with AdamW."""
optimizer = AdamW(lr=lr, weight_decay=weight_decay)
warmup_steps = 1000
total_steps = 0
for epoch in range(epochs):
for batch in data:
# Forward pass
loss, grads = model.forward_backward(batch)
# Warmup learning rate
current_lr = warmup_lr(total_steps, warmup_steps, lr)
optimizer.lr = current_lr
# Clip gradients
grads = clip_grad_norm(grads, max_norm=1.0)
# Update parameters
for param, grad in zip(model.params, grads):
param = optimizer.step(param, grad)
total_steps += 1
print(f"Epoch {epoch}: loss = {loss:.4f}")
Historical Note¶
The development of adaptive methods:
- 2011: AdaGrad (Duchi et al.) — First per-parameter adaptation
- 2012: RMSprop (Hinton) — Unpublished Coursera lecture!
- 2013: Adadelta (Zeiler) — No learning rate needed
- 2014: Adam (Kingma & Ba) — Momentum + RMSprop
- 2017: AdamW (Loshchilov & Hutter) — Fixed weight decay
Adam became the default optimizer, but research continues:
- RAdam (2019): Rectified Adam with variance reduction
- LAMB (2019): Layer-wise adaptive rates for large batches
- AdaFactor (2018): Memory-efficient Adam for huge models
- Lion (2023): Discovered by neural architecture search
Exercises¶
-
Implement AdaGrad: Train on a sparse problem and observe learning rate decay.
-
Compare optimizers: Train the same model with SGD, momentum, RMSprop, Adam. Plot training curves.
-
Ablate Adam: What happens if you remove bias correction? Momentum? Adaptive rates?
-
β₂ sensitivity: Train with β₂ = 0.9, 0.99, 0.999, 0.9999. How does it affect stability?
-
Weight decay ablation: Compare Adam with L2 vs AdamW. Is there a difference?
Summary¶
| Concept | Definition | Key Insight |
|---|---|---|
| AdaGrad | G += g² | Adapts to gradient history |
| RMSprop | v = βv + (1-β)g² | Exponential decay prevents stalling |
| Adam | m, v moments + bias correction | Momentum + adaptation |
| AdamW | Decoupled weight decay | Proper regularization |
Key takeaway: Adaptive methods automatically tune per-parameter learning rates based on gradient history. Adam combines the benefits of momentum (smooth direction) and RMSprop (adaptive magnitude), making it the go-to optimizer for modern deep learning. AdamW's proper treatment of weight decay makes it the standard for LLM training.