Section 4.2: Gradient Descent — The Foundation¶
Reading time: 18 minutes | Difficulty: ★★★☆☆
Gradient descent is the workhorse of machine learning optimization. This section derives the algorithm from first principles, proves when it converges, and builds intuition for why it works.
The Core Idea¶
We want to minimize L(θ). We can't solve ∇L(θ) = 0 analytically for neural networks. Instead, we iterate:
- Compute the gradient ∇L(θ) at current position
- Take a small step in the opposite direction
- Repeat until convergence
The update rule:
where η (eta) is the learning rate — how big a step we take.
Why Negative Gradient?¶
The gradient ∇L(θ) points in the direction of steepest increase. We want to decrease, so we go the opposite way.
Formal justification: Consider a Taylor expansion around θ:
To decrease L, we want: \(\nabla L(\theta)^T \Delta\theta < 0\)
The most negative value for a fixed step size ‖Δθ‖ = η occurs when:
This is exactly the negative gradient direction (normalized). Taking Δθ = -η∇L gives the steepest descent.
Derivation from Taylor Series¶
Let's be more rigorous. The second-order Taylor expansion is:
where H is the Hessian matrix.
Setting Δθ = -η∇L:
For small η, the dominant term is \(-\eta \|\nabla L\|^2 < 0\).
Key insight: As long as ∇L ≠ 0 and η is small enough, we're guaranteed to decrease the loss!
Choosing the Learning Rate¶
The learning rate η is crucial:
| η too small | η too large |
|---|---|
| Slow convergence | Overshooting |
| May get stuck | Divergence |
| Wastes computation | Oscillation |
The Goldilocks Zone¶
For a quadratic function \(L(\theta) = \frac{1}{2}\theta^T H \theta\) with eigenvalues λ₁ ≤ ... ≤ λₙ:
- Convergence requires: η < 2/λₘₐₓ
- Optimal for quadratic: η = 2/(λₘᵢₙ + λₘₐₓ)
Common Mistake: Fixed Learning Rate
Using the same learning rate throughout training is usually suboptimal. Early in training, you can take larger steps. Later, you need smaller steps for fine-tuning. This motivates learning rate schedules (Section 4.6).
Practical Heuristics¶
Since we can't compute eigenvalues for neural networks:
- Start small: η = 0.001 is a common default
- Learning rate finder: Sweep η from 10⁻⁷ to 10, plot loss vs η
- If loss explodes: Reduce η by 10x
- If loss plateaus: Try increasing η
Convergence Analysis¶
When does gradient descent converge? Under what conditions?
Lipschitz Continuous Gradients¶
A function has L-Lipschitz continuous gradients if:
This means the gradient doesn't change too rapidly. L is called the Lipschitz constant.
Convergence Theorem¶
Theorem: For a function with L-Lipschitz gradients and learning rate η ≤ 1/L:
Proof sketch:
- From Lipschitz property: \(L(\theta_{t+1}) \leq L(\theta_t) - \frac{\eta}{2}\|\nabla L(\theta_t)\|^2\)
- Sum over t = 0 to T-1
- Use telescoping and rearrange
Interpretation: After T iterations, we're within O(1/T) of the optimum. To halve the error, we need to double the iterations.
For Strongly Convex Functions¶
If L is also μ-strongly convex (curves upward everywhere):
This is linear convergence — exponentially fast! The rate depends on the condition number κ = L/μ.
The Algorithm¶
def gradient_descent(loss_fn, grad_fn, theta_init, lr, num_steps):
"""
Vanilla gradient descent.
Args:
loss_fn: Function computing L(θ)
grad_fn: Function computing ∇L(θ)
theta_init: Initial parameters
lr: Learning rate η
num_steps: Number of iterations
Returns:
Final parameters and loss history
"""
theta = theta_init.copy()
history = []
for t in range(num_steps):
loss = loss_fn(theta)
grad = grad_fn(theta)
history.append(loss)
# The core update
theta = theta - lr * grad
return theta, history
Visualizing Gradient Descent¶
Consider minimizing \(L(x, y) = x^2 + 10y^2\) (an elliptical bowl):
Step 0: (5.0, 5.0), Loss = 275.0
Step 1: (4.0, 3.0), Loss = 106.0
Step 2: (3.2, 1.8), Loss = 42.64
...
Step 20: (0.11, 0.00), Loss = 0.012
The path zigzags because:
- The gradient in y is 10x larger than in x
- We overshoot in y, undershoot in x
- This is the "condition number problem"
Limitations of Vanilla Gradient Descent¶
1. Sensitive to Condition Number¶
For ill-conditioned problems (elongated loss landscapes), convergence is slow due to zigzagging.
Solution: Momentum (Section 4.4) or adaptive learning rates (Section 4.5).
2. Same Learning Rate for All Parameters¶
Some parameters may need larger updates, others smaller.
Solution: Per-parameter adaptive rates (AdaGrad, Adam).
3. Stuck in Saddle Points¶
At saddle points, the gradient is zero, so we stop moving.
Solution: Noise from stochastic gradients helps escape.
4. Can't Escape Local Minima¶
Pure gradient descent always goes downhill. It can't climb over barriers to find better minima.
Solution: Stochastic noise, learning rate schedules, or random restarts.
Batch Gradient Descent vs. Full Gradient¶
Full (batch) gradient descent: Compute gradient over ALL training data.
Problems:
- For N = 1 billion examples, this is very expensive
- Must load all data into memory
- One update per full pass through data
Solution: Stochastic Gradient Descent (next section).
Historical Note¶
Augustin-Louis Cauchy first described gradient descent in 1847 for solving systems of equations. He called it "méthode de descente" (descent method).
The method was rediscovered and popularized in the 1950s with the advent of computers. Haskell Curry (1944) and Kenneth Arrow & Leonid Hurwicz (1958) provided convergence analyses.
For neural networks, gradient descent combined with backpropagation was popularized by Rumelhart, Hinton, and Williams (1986).
Connection to Modern LLMs
Modern LLMs don't use vanilla gradient descent. They use:
- Stochastic gradients (mini-batches of 1-4 million tokens)
- Momentum (usually β = 0.9)
- Adaptive learning rates (Adam or AdamW)
- Learning rate warmup (prevents early instability)
- Gradient clipping (prevents explosions)
But all of these are built on the foundation we derived here.
Complexity Analysis¶
| Operation | Cost |
|---|---|
| Compute full gradient | O(N × n) |
| Update parameters | O(n) |
| Total per iteration | O(N × n) |
Where N = dataset size, n = number of parameters.
For GPT-3 (175B parameters) on 300B tokens: each full gradient would require ~10²³ FLOPs. This is why we use stochastic methods.
Exercises¶
-
Implement gradient descent: Write gradient descent for \(L(x) = x^4 - 3x^2 + 2\). Find all critical points.
-
Learning rate exploration: For \(L(x,y) = x^2 + 100y^2\), run gradient descent with η = 0.001, 0.01, 0.1. What happens?
-
Convergence rate: Empirically verify the O(1/T) convergence rate on a convex quadratic.
-
Visualize trajectories: Plot the optimization path on a contour plot of the loss surface.
-
Compare to closed-form: For linear regression, compare gradient descent solution to the closed-form solution \((X^TX)^{-1}X^Ty\).
Summary¶
| Concept | Definition | Key Insight |
|---|---|---|
| Gradient descent | θ ← θ - η∇L | Move opposite to gradient |
| Learning rate | Step size η | Too big = diverge, too small = slow |
| Convergence rate | O(1/T) general, O(ρᵀ) strongly convex | Depends on problem conditioning |
| Lipschitz constant | Upper bound on gradient change | Determines max safe learning rate |
Key takeaway: Gradient descent is simple but powerful. Its limitations (sensitivity to conditioning, same rate for all parameters) motivate the more sophisticated methods we'll study next.