Section 4.2: Gradient Descent — The Foundation¶

Reading time: 18 minutes | Difficulty: ★★★☆☆

Gradient descent is the workhorse of machine learning optimization. This section derives the algorithm from first principles, proves when it converges, and builds intuition for why it works.

The Core Idea¶

We want to minimize L(θ). We can't solve ∇L(θ) = 0 analytically for neural networks. Instead, we iterate:

Compute the gradient ∇L(θ) at current position
Take a small step in the opposite direction
Repeat until convergence

The update rule:

\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\]

where η (eta) is the learning rate — how big a step we take.

Why Negative Gradient?¶

The gradient ∇L(θ) points in the direction of steepest increase. We want to decrease, so we go the opposite way.

Formal justification: Consider a Taylor expansion around θ:

\[L(\theta + \Delta\theta) \approx L(\theta) + \nabla L(\theta)^T \Delta\theta + O(\|\Delta\theta\|^2)\]

To decrease L, we want: \(\nabla L(\theta)^T \Delta\theta < 0\)

The most negative value for a fixed step size ‖Δθ‖ = η occurs when:

\[\Delta\theta = -\eta \frac{\nabla L(\theta)}{\|\nabla L(\theta)\|}\]

This is exactly the negative gradient direction (normalized). Taking Δθ = -η∇L gives the steepest descent.

Derivation from Taylor Series¶

Let's be more rigorous. The second-order Taylor expansion is:

\[L(\theta + \Delta\theta) = L(\theta) + \nabla L(\theta)^T \Delta\theta + \frac{1}{2} \Delta\theta^T H \Delta\theta + O(\|\Delta\theta\|^3)\]

where H is the Hessian matrix.

Setting Δθ = -η∇L:

\[L(\theta - \eta\nabla L) = L(\theta) - \eta \|\nabla L\|^2 + \frac{\eta^2}{2} \nabla L^T H \nabla L + O(\eta^3)\]

For small η, the dominant term is \(-\eta \|\nabla L\|^2 < 0\).

Key insight: As long as ∇L ≠ 0 and η is small enough, we're guaranteed to decrease the loss!

Choosing the Learning Rate¶

The learning rate η is crucial:

η too small	η too large
Slow convergence	Overshooting
May get stuck	Divergence
Wastes computation	Oscillation

The Goldilocks Zone¶

For a quadratic function \(L(\theta) = \frac{1}{2}\theta^T H \theta\) with eigenvalues λ₁ ≤ ... ≤ λₙ:

Convergence requires: η < 2/λₘₐₓ
Optimal for quadratic: η = 2/(λₘᵢₙ + λₘₐₓ)

Common Mistake: Fixed Learning Rate

Using the same learning rate throughout training is usually suboptimal. Early in training, you can take larger steps. Later, you need smaller steps for fine-tuning. This motivates learning rate schedules (Section 4.6).

Practical Heuristics¶

Since we can't compute eigenvalues for neural networks:

Start small: η = 0.001 is a common default
Learning rate finder: Sweep η from 10⁻⁷ to 10, plot loss vs η
If loss explodes: Reduce η by 10x
If loss plateaus: Try increasing η

Convergence Analysis¶

When does gradient descent converge? Under what conditions?

Lipschitz Continuous Gradients¶

A function has L-Lipschitz continuous gradients if:

\[\|\nabla L(\theta_1) - \nabla L(\theta_2)\| \leq L \|\theta_1 - \theta_2\|\]

This means the gradient doesn't change too rapidly. L is called the Lipschitz constant.

Convergence Theorem¶

Theorem: For a function with L-Lipschitz gradients and learning rate η ≤ 1/L:

\[L(\theta_T) - L(\theta^*) \leq \frac{\|\theta_0 - \theta^*\|^2}{2\eta T}\]

Proof sketch:

From Lipschitz property: \(L(\theta_{t+1}) \leq L(\theta_t) - \frac{\eta}{2}\|\nabla L(\theta_t)\|^2\)
Sum over t = 0 to T-1
Use telescoping and rearrange

Interpretation: After T iterations, we're within O(1/T) of the optimum. To halve the error, we need to double the iterations.

For Strongly Convex Functions¶

If L is also μ-strongly convex (curves upward everywhere):

\[L(\theta_T) - L(\theta^*) \leq \left(1 - \frac{\mu}{L}\right)^T (L(\theta_0) - L(\theta^*))\]

This is linear convergence — exponentially fast! The rate depends on the condition number κ = L/μ.

The Algorithm¶

def gradient_descent(loss_fn, grad_fn, theta_init, lr, num_steps):
    """
    Vanilla gradient descent.

    Args:
        loss_fn: Function computing L(θ)
        grad_fn: Function computing ∇L(θ)
        theta_init: Initial parameters
        lr: Learning rate η
        num_steps: Number of iterations

    Returns:
        Final parameters and loss history
    """
    theta = theta_init.copy()
    history = []

    for t in range(num_steps):
        loss = loss_fn(theta)
        grad = grad_fn(theta)

        history.append(loss)

        # The core update
        theta = theta - lr * grad

    return theta, history

Visualizing Gradient Descent¶

Consider minimizing \(L(x, y) = x^2 + 10y^2\) (an elliptical bowl):

Step 0: (5.0, 5.0), Loss = 275.0
Step 1: (4.0, 3.0), Loss = 106.0
Step 2: (3.2, 1.8), Loss = 42.64
...
Step 20: (0.11, 0.00), Loss = 0.012

The path zigzags because:

The gradient in y is 10x larger than in x
We overshoot in y, undershoot in x
This is the "condition number problem"

     y
     ↑     Start
     │      ╲
     │       ╲
     │        ↘
     │      ↙  ↘
     │    ↙     ↘
     │  ↙        →→→ Optimum
     └──────────────→ x

Limitations of Vanilla Gradient Descent¶

1. Sensitive to Condition Number¶

For ill-conditioned problems (elongated loss landscapes), convergence is slow due to zigzagging.

Solution: Momentum (Section 4.4) or adaptive learning rates (Section 4.5).

2. Same Learning Rate for All Parameters¶

Some parameters may need larger updates, others smaller.

Solution: Per-parameter adaptive rates (AdaGrad, Adam).

3. Stuck in Saddle Points¶

At saddle points, the gradient is zero, so we stop moving.

Solution: Noise from stochastic gradients helps escape.

4. Can't Escape Local Minima¶

Pure gradient descent always goes downhill. It can't climb over barriers to find better minima.

Solution: Stochastic noise, learning rate schedules, or random restarts.

Batch Gradient Descent vs. Full Gradient¶

Full (batch) gradient descent: Compute gradient over ALL training data.

\[\nabla L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla \ell_i(\theta)\]

Problems:

For N = 1 billion examples, this is very expensive
Must load all data into memory
One update per full pass through data

Solution: Stochastic Gradient Descent (next section).

Historical Note¶

Augustin-Louis Cauchy first described gradient descent in 1847 for solving systems of equations. He called it "méthode de descente" (descent method).

The method was rediscovered and popularized in the 1950s with the advent of computers. Haskell Curry (1944) and Kenneth Arrow & Leonid Hurwicz (1958) provided convergence analyses.

For neural networks, gradient descent combined with backpropagation was popularized by Rumelhart, Hinton, and Williams (1986).

Connection to Modern LLMs

Modern LLMs don't use vanilla gradient descent. They use:

Stochastic gradients (mini-batches of 1-4 million tokens)
Momentum (usually β = 0.9)
Adaptive learning rates (Adam or AdamW)
Learning rate warmup (prevents early instability)
Gradient clipping (prevents explosions)

But all of these are built on the foundation we derived here.

Complexity Analysis¶

Operation	Cost
Compute full gradient	O(N × n)
Update parameters	O(n)
Total per iteration	O(N × n)

Where N = dataset size, n = number of parameters.

For GPT-3 (175B parameters) on 300B tokens: each full gradient would require ~10²³ FLOPs. This is why we use stochastic methods.

Exercises¶

Implement gradient descent: Write gradient descent for \(L(x) = x^4 - 3x^2 + 2\). Find all critical points.
Learning rate exploration: For \(L(x,y) = x^2 + 100y^2\), run gradient descent with η = 0.001, 0.01, 0.1. What happens?
Convergence rate: Empirically verify the O(1/T) convergence rate on a convex quadratic.
Visualize trajectories: Plot the optimization path on a contour plot of the loss surface.
Compare to closed-form: For linear regression, compare gradient descent solution to the closed-form solution \((X^TX)^{-1}X^Ty\).

Summary¶

Concept	Definition	Key Insight
Gradient descent	θ ← θ - η∇L	Move opposite to gradient
Learning rate	Step size η	Too big = diverge, too small = slow
Convergence rate	O(1/T) general, O(ρᵀ) strongly convex	Depends on problem conditioning
Lipschitz constant	Upper bound on gradient change	Determines max safe learning rate

Key takeaway: Gradient descent is simple but powerful. Its limitations (sensitivity to conditioning, same rate for all parameters) motivate the more sophisticated methods we'll study next.

→ Next: Section 4.3: Stochastic Gradient Descent