Section 4.1: The Optimization Problem¶
Reading time: 15 minutes | Difficulty: ★★☆☆☆
Before we can solve optimization, we need to understand what we're optimizing. This section frames the problem precisely and builds intuition about loss landscapes.
The Setup¶
We have:
- A model with parameters θ (weights and biases)
- A loss function L(θ) that measures how wrong the model is
- Training data that defines what "wrong" means
Goal: Find θ* = argmin L(θ)
This seems simple. Why is it hard?
Why Optimization is Hard¶
Problem 1: High Dimensionality¶
A small language model might have 100 million parameters. We're searching for a minimum in 100,000,000-dimensional space.
Intuition: In 2D, you can visualize the landscape. In 100M dimensions: - There are exponentially many directions to explore - Local structure doesn't tell you about global structure - Random search is hopeless
Problem 2: Non-Convexity¶
A function is convex if any line between two points on the function lies above the function:
For convex functions, any local minimum is also a global minimum. Life is easy.
Neural network loss functions are NOT convex. They have: - Multiple local minima - Saddle points (minima in some directions, maxima in others) - Flat regions (plateaus)
Common Misconception
"We need to find the global minimum."
Actually, research shows that for overparameterized neural networks, most local minima are nearly as good as the global minimum. The real challenge is escaping saddle points and plateaus.
Problem 3: Stochasticity¶
We can't compute the true gradient over all training data (too expensive). Instead, we estimate it using mini-batches, introducing noise.
This noise is both:
- A problem: Makes optimization less stable
- A feature: Acts as regularization, helping generalization
Loss Landscapes¶
What Do They Look Like?¶
For neural networks, the loss landscape has interesting structure:
╱╲
╱ ╲ Local minimum
╱ ╲ ↓
╱╲ ╱ ╲ ╱────╲
╱ ╲ ╱ ╲ ╱ ╲
──────╱ ╲───╱ ╲───╱ ╲────────
↑ ↑
Saddle point Global minimum
Key features:
- Many directions to explore (high-dimensional)
- Valleys that lead toward good solutions
- Saddle points that can trap gradient descent
- Flat regions where gradients vanish
Saddle Points Dominate¶
In high dimensions, saddle points are much more common than local minima.
Why? At a critical point (where ∇L = 0), the Hessian matrix H determines the type: - All eigenvalues positive → local minimum - All eigenvalues negative → local maximum - Mixed signs → saddle point
With 100 million parameters, the probability that ALL eigenvalues have the same sign is essentially zero. Almost all critical points are saddle points.
Connection to Modern LLMs
Large language models like GPT-4 have hundreds of billions of parameters. At this scale, the optimization landscape becomes even more interesting. Research suggests that:
- The loss landscape has many "valleys" that all lead to similarly good solutions
- Wider minima (flat regions) generalize better than sharp ones
- The path taken during optimization matters for final performance
Mathematical Framework¶
First-Order Information: Gradients¶
The gradient ∇L(θ) tells us the direction of steepest ascent. Move in the opposite direction to decrease loss:
where η is the learning rate.
Limitation: Gradient only gives local information. A small gradient could mean: - We're near a minimum (good!) - We're on a flat plateau (bad) - We're at a saddle point (tricky)
Second-Order Information: Hessians¶
The Hessian matrix H contains all second derivatives:
The Hessian tells us about curvature:
- Positive eigenvalues → curving up (bowl shape)
- Negative eigenvalues → curving down (dome shape)
- Mixed → saddle point
Second-order methods like Newton's method use the Hessian to take smarter steps:
Problem: For 100M parameters, H is a 100M × 100M matrix. Computing and inverting it is impossible.
This is why we use first-order methods with clever enhancements (momentum, adaptive learning rates).
Condition Number and Convergence¶
The condition number κ of a matrix measures how "stretched" it is:
where λ are eigenvalues of the Hessian.
Why it matters: Gradient descent converges in O(κ) iterations. If the landscape is very elongated (high κ), convergence is slow.
Low condition number (κ ≈ 1): High condition number (κ >> 1):
Bowl-shaped Elongated valley
╱╲ /
/ \ /
/ \ /
╱ ╲ /
────────── /
─╱
In the elongated case, the gradient points sideways, not toward the minimum. This is the "zig-zag" problem that momentum helps solve.
Complexity Analysis¶
| Operation | Cost | Notes |
|---|---|---|
| Compute gradient | O(n) | One backward pass |
| Compute full Hessian | O(n²) | Prohibitive for large n |
| Store Hessian | O(n²) | 100M params = \(10^16\) bytes |
| Hessian-vector product | O(n) | Used in some methods |
This is why first-order methods dominate deep learning.
Historical Note¶
Cauchy (1847) first described gradient descent for solving systems of equations. The method was largely forgotten until the 1950s when computers made iterative optimization practical.
Robbins and Monro (1951) proved that stochastic approximation (what we now call SGD) converges under certain conditions. This theoretical foundation enabled modern deep learning.
Exercises¶
-
Visualize a loss landscape: For a 2-parameter model, plot the loss surface and gradient field.
-
Count saddle points: For a simple function like f(x,y) = x² - y², find all critical points and classify them.
-
Condition number experiment: Create loss functions with different condition numbers and observe how gradient descent behaves.
Summary¶
| Concept | Definition | Why It Matters |
|---|---|---|
| Loss landscape | L(θ) as a function of parameters | Determines optimization difficulty |
| Convexity | Line above function property | Convex = easy optimization |
| Saddle points | ∇L = 0 with mixed Hessian signs | Dominate in high dimensions |
| Condition number | λmax / λmin | Determines convergence speed |
Key insight: Deep learning optimization is hard because we're searching in extremely high-dimensional, non-convex spaces where gradient information is noisy and incomplete. Yet somehow, first-order methods work remarkably well. Understanding why is the focus of this stage.
→ Next: Section 4.2: Gradient Descent