Section 4.1: The Optimization Problem¶

Reading time: 15 minutes | Difficulty: ★★☆☆☆

Before we can solve optimization, we need to understand what we're optimizing. This section frames the problem precisely and builds intuition about loss landscapes.

The Setup¶

We have:

A model with parameters θ (weights and biases)
A loss function L(θ) that measures how wrong the model is
Training data that defines what "wrong" means

Goal: Find θ* = argmin L(θ)

This seems simple. Why is it hard?

Why Optimization is Hard¶

Problem 1: High Dimensionality¶

A small language model might have 100 million parameters. We're searching for a minimum in 100,000,000-dimensional space.

Intuition: In 2D, you can visualize the landscape. In 100M dimensions: - There are exponentially many directions to explore - Local structure doesn't tell you about global structure - Random search is hopeless

Problem 2: Non-Convexity¶

A function is convex if any line between two points on the function lies above the function:

\[f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha) f(y)\]

For convex functions, any local minimum is also a global minimum. Life is easy.

Neural network loss functions are NOT convex. They have: - Multiple local minima - Saddle points (minima in some directions, maxima in others) - Flat regions (plateaus)

Common Misconception

"We need to find the global minimum."

Actually, research shows that for overparameterized neural networks, most local minima are nearly as good as the global minimum. The real challenge is escaping saddle points and plateaus.

Problem 3: Stochasticity¶

We can't compute the true gradient over all training data (too expensive). Instead, we estimate it using mini-batches, introducing noise.

This noise is both:

A problem: Makes optimization less stable
A feature: Acts as regularization, helping generalization

Loss Landscapes¶

What Do They Look Like?¶

For neural networks, the loss landscape has interesting structure:

                    ╱╲
                   ╱  ╲         Local minimum
                  ╱    ╲            ↓
        ╱╲       ╱      ╲       ╱────╲
       ╱  ╲     ╱        ╲     ╱      ╲
──────╱    ╲───╱          ╲───╱        ╲────────
              ↑              ↑
         Saddle point    Global minimum

Key features:

Many directions to explore (high-dimensional)
Valleys that lead toward good solutions
Saddle points that can trap gradient descent
Flat regions where gradients vanish

Saddle Points Dominate¶

In high dimensions, saddle points are much more common than local minima.

Why? At a critical point (where ∇L = 0), the Hessian matrix H determines the type: - All eigenvalues positive → local minimum - All eigenvalues negative → local maximum - Mixed signs → saddle point

With 100 million parameters, the probability that ALL eigenvalues have the same sign is essentially zero. Almost all critical points are saddle points.

Connection to Modern LLMs

Large language models like GPT-4 have hundreds of billions of parameters. At this scale, the optimization landscape becomes even more interesting. Research suggests that:

The loss landscape has many "valleys" that all lead to similarly good solutions
Wider minima (flat regions) generalize better than sharp ones
The path taken during optimization matters for final performance

Mathematical Framework¶

First-Order Information: Gradients¶

The gradient ∇L(θ) tells us the direction of steepest ascent. Move in the opposite direction to decrease loss:

\[\theta_{new} = \theta_{old} - \eta \nabla L(\theta_{old})\]

where η is the learning rate.

Limitation: Gradient only gives local information. A small gradient could mean: - We're near a minimum (good!) - We're on a flat plateau (bad) - We're at a saddle point (tricky)

Second-Order Information: Hessians¶

The Hessian matrix H contains all second derivatives:

\[H_{ij} = \frac{\partial^2 L}{\partial \theta_i \partial \theta_j}\]

The Hessian tells us about curvature:

Positive eigenvalues → curving up (bowl shape)
Negative eigenvalues → curving down (dome shape)
Mixed → saddle point

Second-order methods like Newton's method use the Hessian to take smarter steps:

\[\theta_{new} = \theta_{old} - H^{-1} \nabla L\]

Problem: For 100M parameters, H is a 100M × 100M matrix. Computing and inverting it is impossible.

This is why we use first-order methods with clever enhancements (momentum, adaptive learning rates).

Condition Number and Convergence¶

The condition number κ of a matrix measures how "stretched" it is:

\[\kappa = \frac{\lambda_{max}}{\lambda_{min}}\]

where λ are eigenvalues of the Hessian.

Why it matters: Gradient descent converges in O(κ) iterations. If the landscape is very elongated (high κ), convergence is slow.

Low condition number (κ ≈ 1):     High condition number (κ >> 1):
     Bowl-shaped                       Elongated valley
         ╱╲                               /
        /  \                             /
       /    \                           /
      ╱      ╲                         /
     ──────────                       /
                                    ─╱

In the elongated case, the gradient points sideways, not toward the minimum. This is the "zig-zag" problem that momentum helps solve.

Complexity Analysis¶

Operation	Cost	Notes
Compute gradient	O(n)	One backward pass
Compute full Hessian	O(n²)	Prohibitive for large n
Store Hessian	O(n²)	100M params = \(10^16\) bytes
Hessian-vector product	O(n)	Used in some methods

This is why first-order methods dominate deep learning.

Historical Note¶

Cauchy (1847) first described gradient descent for solving systems of equations. The method was largely forgotten until the 1950s when computers made iterative optimization practical.

Robbins and Monro (1951) proved that stochastic approximation (what we now call SGD) converges under certain conditions. This theoretical foundation enabled modern deep learning.

Exercises¶

Visualize a loss landscape: For a 2-parameter model, plot the loss surface and gradient field.
Count saddle points: For a simple function like f(x,y) = x² - y², find all critical points and classify them.
Condition number experiment: Create loss functions with different condition numbers and observe how gradient descent behaves.

Summary¶

Concept	Definition	Why It Matters
Loss landscape	L(θ) as a function of parameters	Determines optimization difficulty
Convexity	Line above function property	Convex = easy optimization
Saddle points	∇L = 0 with mixed Hessian signs	Dominate in high dimensions
Condition number	λmax / λmin	Determines convergence speed

Key insight: Deep learning optimization is hard because we're searching in extremely high-dimensional, non-convex spaces where gradient information is noisy and incomplete. Yet somehow, first-order methods work remarkably well. Understanding why is the focus of this stage.

→ Next: Section 4.2: Gradient Descent