Section 4.6: Learning Rate Schedules — The Art of Annealing¶
Reading time: 16 minutes | Difficulty: ★★★☆☆
The learning rate is the most important hyperparameter. But a single fixed value is rarely optimal. This section covers how to vary the learning rate during training for faster convergence and better final performance.
Why Schedules Matter¶
Early training: We want large steps to make rapid progress Late training: We want small steps to fine-tune and converge
A fixed learning rate forces a compromise. Schedules let us have both.
Common Schedules¶
Step Decay¶
Reduce learning rate by a factor at fixed intervals:
Where:
- η₀ is the initial learning rate
- γ is the decay factor (e.g., 0.1)
- s is the step size (e.g., every 30 epochs)
def step_decay(step, init_lr, decay_rate=0.1, decay_every=30):
"""Step decay: reduce by factor every N epochs."""
return init_lr * (decay_rate ** (step // decay_every))
Pros: Simple, interpretable, works well Cons: Discontinuous, requires choosing when to decay
Exponential Decay¶
Smooth continuous decay:
def exponential_decay(step, init_lr, decay_rate=0.99):
"""Exponential decay: multiply by rate each step."""
return init_lr * (decay_rate ** step)
Pros: Smooth, no discontinuities Cons: Decays too fast early, too slow late
Inverse Square Root Decay¶
Popular for transformers:
Or with warmup:
def inverse_sqrt_decay(step, init_lr, warmup_steps=4000):
"""Inverse square root with linear warmup (Transformer schedule)."""
if step == 0:
step = 1
warmup_factor = min(1.0, step / warmup_steps)
decay_factor = 1.0 / np.sqrt(max(step, warmup_steps))
return init_lr * warmup_factor * decay_factor * np.sqrt(warmup_steps)
This is the original Transformer schedule from "Attention Is All You Need" (2017).
Cosine Annealing¶
Smooth decay following a cosine curve:
def cosine_annealing(step, total_steps, init_lr, min_lr=0):
"""Cosine annealing from init_lr to min_lr."""
return min_lr + 0.5 * (init_lr - min_lr) * (1 + np.cos(np.pi * step / total_steps))
Pros: Smooth, no hyperparameters except endpoints Cons: Need to know total training steps in advance
Connection to Modern LLMs
Cosine annealing is the standard for LLM training.
Typical settings: - Warmup: 1-2% of total steps - Peak learning rate: 1e-4 to 3e-4 - Decay to: 0.1 × peak (or 0) - Total steps: 100K to 1M
Example from LLaMA training:
Warmup: Starting Slow¶
Why Warmup?¶
At initialization:
- Weights are random
- Gradients are unreliable
- Large steps could be catastrophic
Warmup starts with tiny learning rate and gradually increases:
def linear_warmup(step, warmup_steps, target_lr):
"""Linear warmup from 0 to target_lr."""
if step < warmup_steps:
return target_lr * step / warmup_steps
return target_lr
def warmup_cosine_decay(step, warmup_steps, total_steps, init_lr, min_lr=0):
"""Linear warmup followed by cosine decay."""
if step < warmup_steps:
# Linear warmup
return init_lr * step / warmup_steps
else:
# Cosine decay
progress = (step - warmup_steps) / (total_steps - warmup_steps)
return min_lr + 0.5 * (init_lr - min_lr) * (1 + np.cos(np.pi * progress))
How Much Warmup?¶
| Model Size | Typical Warmup |
|---|---|
| Small (< 100M) | 100-1000 steps |
| Medium (100M-1B) | 1000-5000 steps |
| Large (> 1B) | 2000-10000 steps |
Rule of thumb: 1-5% of total training steps.
Warmup for Large Batches¶
Large batch training requires careful warmup:
The linear scaling rule (Goyal et al., 2017): - If batch size increases by k, multiply learning rate by k - But warmup for k× longer
def large_batch_schedule(step, base_lr, base_batch, actual_batch, warmup_steps):
"""Learning rate schedule for large batch training."""
# Scale learning rate with batch size
scale = actual_batch / base_batch
target_lr = base_lr * scale
# Extended warmup
scaled_warmup = warmup_steps * scale
if step < scaled_warmup:
return target_lr * step / scaled_warmup
return target_lr
Cosine Annealing with Restarts¶
Loshchilov & Hutter (2016): SGDR (SGD with Warm Restarts)
Instead of decaying once, reset to high learning rate periodically:
def cosine_with_restarts(step, init_lr, restart_period, restart_mult=2):
"""Cosine annealing with warm restarts."""
# Find which cycle we're in
cycle = 0
cycle_start = 0
current_period = restart_period
while step >= cycle_start + current_period:
cycle_start += current_period
current_period *= restart_mult
cycle += 1
# Position within current cycle
progress = (step - cycle_start) / current_period
return init_lr * 0.5 * (1 + np.cos(np.pi * progress))
Why restarts help:
- Escape local minima
- Explore different regions of loss landscape
- Create "snapshots" at each minimum for ensembling
One-Cycle Policy¶
Smith (2018): Train with one big cycle
- Warmup from low to high learning rate
- Anneal from high back to very low
def one_cycle(step, total_steps, max_lr, div_factor=25, final_div=1e4):
"""One-cycle learning rate policy."""
initial_lr = max_lr / div_factor
final_lr = max_lr / final_div
if step < total_steps * 0.3:
# Warmup phase: 30% of training
progress = step / (total_steps * 0.3)
return initial_lr + (max_lr - initial_lr) * progress
else:
# Annealing phase: 70% of training
progress = (step - total_steps * 0.3) / (total_steps * 0.7)
return max_lr - (max_lr - final_lr) * progress
The one-cycle policy often allows training with much higher learning rates!
Learning Rate Finding¶
Smith (2015): Learning Rate Range Test
Before training, find the optimal learning rate:
- Start with very small η (e.g., 1e-7)
- Train for a few iterations, gradually increasing η
- Plot loss vs learning rate
- Choose η where loss decreases fastest (not the minimum!)
def lr_finder(model, train_fn, init_lr=1e-7, final_lr=10, num_steps=100):
"""Find optimal learning rate by exponential sweep."""
mult = (final_lr / init_lr) ** (1 / num_steps)
lr = init_lr
lrs, losses = [], []
for step in range(num_steps):
loss = train_fn(model, lr)
lrs.append(lr)
losses.append(loss)
# Exponentially increase
lr *= mult
# Stop if loss explodes
if loss > 4 * min(losses):
break
return lrs, losses
Choose learning rate at the steepest descent (before the minimum), typically 1/10 of the minimum loss learning rate.
Practical Schedule for LLM Training¶
Here's a complete schedule matching modern practice:
class LLMSchedule:
"""Learning rate schedule for LLM training."""
def __init__(
self,
peak_lr=3e-4,
min_lr=3e-5,
warmup_steps=2000,
total_steps=100000,
):
self.peak_lr = peak_lr
self.min_lr = min_lr
self.warmup_steps = warmup_steps
self.total_steps = total_steps
def get_lr(self, step):
# Phase 1: Linear warmup
if step < self.warmup_steps:
return self.peak_lr * step / self.warmup_steps
# Phase 2: Cosine decay
progress = (step - self.warmup_steps) / (self.total_steps - self.warmup_steps)
progress = min(1.0, progress) # Clamp to [0, 1]
cosine_decay = 0.5 * (1 + np.cos(np.pi * progress))
return self.min_lr + (self.peak_lr - self.min_lr) * cosine_decay
# Usage
schedule = LLMSchedule(
peak_lr=3e-4,
min_lr=3e-5, # 10% of peak
warmup_steps=2000,
total_steps=100000,
)
for step in range(100000):
current_lr = schedule.get_lr(step)
optimizer.lr = current_lr
train_step(...)
Schedule vs Optimizer Choice¶
The schedule and optimizer interact:
| Optimizer | Recommended Schedule |
|---|---|
| SGD | Step decay or cosine |
| Momentum | Cosine with longer warmup |
| Adam | Cosine, less sensitive to schedule |
| AdamW | Cosine with warmup (standard) |
Adam's adaptive rates provide some "built-in" schedule, making it less sensitive to the explicit schedule. But even Adam benefits from warmup and decay.
Historical Note¶
Learning rate schedules evolved with deep learning:
- 1990s: Fixed learning rate, manually tuned
- 2012: Step decay became standard (AlexNet)
- 2015: Learning rate range test (Smith)
- 2016: Cosine annealing, warm restarts (Loshchilov & Hutter)
- 2017: Transformer schedule (Vaswani et al.)
- 2018: One-cycle policy (Smith)
- 2020s: Cosine with warmup is the default for LLMs
Common Mistakes¶
Pitfalls
-
No warmup: Especially with Adam, can cause early instability
-
Too short warmup: May not be enough for large models
-
Decaying too fast: Loss plateaus before convergence
-
Decaying too slow: Final loss higher than necessary
-
Not matching batch size: When scaling batch size, scale warmup too
Exercises¶
-
Compare schedules: Train the same model with constant, step decay, and cosine. Plot training curves.
-
Warmup ablation: Train with warmup = 0, 100, 1000, 10000 steps. What changes?
-
LR finder: Implement the learning rate range test. Find optimal learning rate for a model.
-
Restarts: Implement cosine with restarts. Compare to single cosine decay.
-
One-cycle: Implement the one-cycle policy. Can you use higher learning rates?
Summary¶
| Schedule | Formula | Best For |
|---|---|---|
| Step decay | η₀ × γ^⌊t/s⌋ | Simple, interpretable |
| Exponential | η₀ × γᵗ | Smooth decay |
| Inverse sqrt | η₀ / √t | Transformers (original) |
| Cosine | ½(1 + cos(πt/T)) | LLM training (standard) |
| Warmup + cosine | Linear → cosine | Modern best practice |
Key takeaway: The learning rate schedule is as important as the optimizer choice. Modern LLM training uses linear warmup followed by cosine decay to minimum. This simple schedule, combined with AdamW, forms the backbone of successful large-scale training.
→ Next: Section 4.7: Implementation