Section 4.4: Momentum — Learning with Velocity¶

Reading time: 18 minutes | Difficulty: ★★★☆☆

Gradient descent can be painfully slow on ill-conditioned problems. Momentum fixes this by adding "velocity" to our updates, allowing the optimizer to build up speed in consistent directions.

The Problem with Vanilla Gradient Descent¶

Recall the update rule: θ ← θ - η∇L(θ)

Consider minimizing L(x, y) = x² + 100y²:

     y
     ↑
     │    ←───────────→
     │   ╱  Oscillates  ╲
     │  ╱                ╲
     │ ↙                  ↘
     │↗                    ↖
     └──────────────────────→ x
           Slow progress

The gradient in y is 100× larger than in x, causing:

Oscillation in the y direction (overshooting)
Slow progress in the x direction (undershooting)

This is the condition number problem: κ = $λ_max$/$λ_min$ = 100.

The Physics Analogy¶

Imagine a ball rolling down a hill:

Without friction, it builds up speed going downhill
It carries momentum through flat regions
It overshoots valleys but eventually settles

This is exactly what we want for optimization!

The Momentum Update Rule¶

Classical Momentum (Polyak, 1964):

\[v_t = \beta v_{t-1} + \nabla L(\theta_t)$$ $$\theta_{t+1} = \theta_t - \eta v_t\]

Or equivalently:

\[v_t = \beta v_{t-1} + \eta \nabla L(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t\]

Where:

v is the velocity (accumulated gradient)
β is the momentum coefficient (typically 0.9)
η is the learning rate

Why Momentum Works¶

Intuition 1: Averaging Gradients¶

Expand the velocity term:

\[v_t = \eta \nabla L_t + \beta \eta \nabla L_{t-1} + \beta^2 \eta \nabla L_{t-2} + ...\]

This is an exponentially weighted moving average of gradients!

Recent gradients have weight ~1
Old gradients have weight ~$β^t$ → 0

Intuition 2: Dampening Oscillations¶

In the oscillating y-direction:

Gradients alternate signs: +, -, +, -, ...
They cancel out: v_y ≈ 0

In the consistent x-direction:

Gradients have the same sign: +, +, +, +, ...
They accumulate: v_x grows large

Without momentum:        With momentum:
     ↗↘↗↘↗↘              ────────→
     oscillates          smooth progress

Intuition 3: Effective Learning Rate¶

After many steps with consistent gradient g:

\[v_\infty = g + \beta g + \beta^2 g + ... = \frac{g}{1-\beta}\]

The effective learning rate becomes $\frac{\eta}{1-\beta}$.

For β = 0.9: effective rate is 10× the base rate!

This is why we often reduce η when adding momentum.

Mathematical Analysis¶

Convergence for Quadratics¶

For a quadratic L(θ) = ½θᵀHθ with eigenvalues λ₁ ≤ ... ≤ λₙ:

Without momentum: Convergence rate = (κ - 1)/(κ + 1)

With optimal momentum: Convergence rate = (√κ - 1)/(√κ + 1)

For κ = 100:

Without: rate = 0.98 (slow)
With: rate = 0.82 (much faster!)

Key insight: Momentum reduces the effective condition number from κ to √κ.

Optimal Momentum Coefficient¶

For a quadratic with condition number κ:

\[\beta_{opt} = \left(\frac{\sqrt{\kappa} - 1}{\sqrt{\kappa} + 1}\right)^2\]

In practice, β = 0.9 works well for most problems.

Nesterov Accelerated Gradient (NAG)¶

Nesterov (1983) proposed a subtle but important modification:

Standard momentum: Look at gradient where we ARE $$v_t = \beta v_{t-1} + \nabla L(\theta_t)$$

Nesterov momentum: Look at gradient where we're GOING $$v_t = \beta v_{t-1} + \nabla L(\theta_t - \eta \beta v_{t-1})$$

Or equivalently, with the "lookahead" θ̃: $$\tilde{\theta} = \theta_t - \eta \beta v_{t-1}$$ $$v_t = \beta v_{t-1} + \nabla L(\tilde{\theta})$$ $$\theta_{t+1} = \theta_t - \eta v_t$$

Why Nesterov is Better¶

Standard:                    Nesterov:
     θ                            θ
     │                            │
     ↓ compute gradient           ↓ take momentum step first
     │                            θ̃
     ↓ then momentum step         ↓ compute gradient at θ̃
     θ'                           ↓ correct trajectory
                                  θ'

Nesterov "looks ahead" to where momentum will take us, then corrects. This provides:

Faster convergence on convex problems
Better handling of overshooting
Provably optimal for smooth convex optimization

Convergence Guarantee¶

Theorem (Nesterov, 1983): For smooth convex functions with optimal β:

\[L(\theta_T) - L(\theta^*) \leq O\left(\frac{1}{T^2}\right)\]

This is optimal for first-order methods! (Compared to O(1/T) for vanilla GD)

Implementation¶

def momentum_sgd(loss_fn, grad_fn, theta_init, lr, momentum, num_steps):
    """
    SGD with momentum.

    Args:
        loss_fn: Loss function
        grad_fn: Gradient function
        theta_init: Initial parameters
        lr: Learning rate
        momentum: Momentum coefficient (typically 0.9)
        num_steps: Number of iterations

    Returns:
        Final parameters and history
    """
    theta = theta_init.copy()
    velocity = np.zeros_like(theta)
    history = []

    for t in range(num_steps):
        loss = loss_fn(theta)
        grad = grad_fn(theta)
        history.append(loss)

        # Update velocity
        velocity = momentum * velocity + grad

        # Update parameters
        theta = theta - lr * velocity

    return theta, history


def nesterov_sgd(loss_fn, grad_fn, theta_init, lr, momentum, num_steps):
    """
    SGD with Nesterov momentum.

    The key difference: compute gradient at the "lookahead" position.
    """
    theta = theta_init.copy()
    velocity = np.zeros_like(theta)
    history = []

    for t in range(num_steps):
        # Lookahead position
        lookahead = theta - lr * momentum * velocity

        loss = loss_fn(theta)
        grad = grad_fn(lookahead)  # Gradient at lookahead!
        history.append(loss)

        # Update velocity
        velocity = momentum * velocity + grad

        # Update parameters
        theta = theta - lr * velocity

    return theta, history

Choosing the Momentum Coefficient¶

β Value	Behavior	Use Case
0.0	No momentum (vanilla SGD)	Debugging, baselines
0.5	Light momentum	Noisy gradients
0.9	Standard	Most applications
0.99	Heavy momentum	Very smooth landscapes

Rule of thumb: Start with β = 0.9. If training is unstable, reduce it.

Common Mistake: Not Adjusting Learning Rate

When adding momentum, the effective learning rate increases by ~1/(1-β).

For β = 0.9, effective rate is 10× higher!

Fix: Reduce η by a factor of (1-β) when adding momentum, or tune from scratch.

Momentum and SGD Noise¶

An interesting interaction: momentum smooths out SGD noise!

Without momentum, each step is based on a single noisy gradient estimate.

With momentum, each step is based on an exponential average of many gradients.

Effect: Lower variance updates, but also less exploration.

                    Noise Level
                        ↑
                        │   SGD alone
                        │      •
                        │
                        │         SGD + momentum
                        │              •
                        │
                        │                  Large batch
                        │                      •
                        └──────────────────────────→
                              Stability

Connection to Modern LLMs

Modern LLM training typically uses:

β = 0.9 for the first moment (momentum) in Adam
This provides the noise-smoothing benefits of momentum
Combined with adaptive learning rates (next section)

The momentum component of Adam is critical for stable training on noisy gradients from massive datasets.

Visualizing Momentum¶

Consider the Rosenbrock function: L(x,y) = (1-x)² + 100(y-x²)²

This has a curved valley that's challenging for gradient descent:

Without momentum:           With momentum:
    Start                       Start
      ╲                           ╲
       ╲↗                          ╲
        ╲↙                          ╲
         ╲↗                          ╲
          ╲↙                          ╲
           ╲ (slow zigzag)             ╲ (smooth curve)
            ↓                           ↓
          Optimum                     Optimum

Momentum allows following the valley smoothly rather than bouncing off its walls.

Historical Note¶

Boris Polyak introduced the momentum method in 1964, drawing on classical mechanics.

Yurii Nesterov developed his accelerated method in 1983 as part of his PhD thesis, proving it's optimal for smooth convex optimization.

The connection between momentum and acceleration in physics was made explicit by Sutskever et al. (2013), who showed that Nesterov momentum works well for deep learning.

Heavy Ball vs. Nesterov¶

The original momentum (Polyak) is sometimes called the "Heavy Ball" method:

Method	Update	Convergence (convex)
Heavy Ball	v = βv + ∇L(θ)	O(1/T)
Nesterov	v = βv + ∇L(θ - ηβv)	O(1/T²)

Nesterov's lookahead makes a theoretical difference, though in practice both work similarly for neural networks.

Exercises¶

Visualize momentum: On a 2D quadratic, plot trajectories with β = 0, 0.5, 0.9, 0.99.
Optimal β: For L(x,y) = x² + 100y², derive the optimal momentum coefficient.
Compare methods: Implement both classical and Nesterov momentum. Compare convergence on a test problem.
Effective learning rate: Verify empirically that momentum β = 0.9 gives ~10× effective learning rate.
Noise smoothing: Compare gradient variance with and without momentum on a stochastic problem.

Summary¶

Concept	Definition	Key Insight
Momentum	v = βv + ∇L	Exponential moving average of gradients
β coefficient	Decay rate for velocity	Higher = more "memory"
Nesterov	Gradient at lookahead position	Provably optimal acceleration
Effective rate	η/(1-β)	Momentum amplifies learning rate

Key takeaway: Momentum transforms gradient descent from a memoryless random walk into a smooth trajectory with inertia. This simple addition—tracking velocity—provides dramatic speedups on ill-conditioned problems and remains a core component of modern optimizers like Adam.

→ Next: Section 4.5: Adaptive Learning Rates