Stage 4 Exercises¶

Conceptual Questions¶

Exercise 4.1: Loss Landscape¶

Consider minimizing f(x) = x⁴ - 2x² + 1.

a) Find all critical points (where f'(x) = 0) b) Classify each as local min, local max, or saddle point c) If gradient descent starts at x=0.1, where will it converge?

Exercise 4.2: Learning Rate Effects¶

For f(x) = x², gradient descent update is: x ← x - η * 2x

a) Starting at x=10 with η=0.1, compute x after 5 steps b) What happens with η=0.6? c) What happens with η=1.0? d) What is the maximum stable learning rate?

Exercise 4.3: Momentum Intuition¶

A ball rolling down a hill with momentum:

a) How does momentum help with narrow valleys? b) How does momentum help escape shallow local minima? c) What's the downside of high momentum?

Exercise 4.4: Adam's Components¶

Adam combines momentum and adaptive learning rates.

a) What does the first moment (m) track? b) What does the second moment (v) track? c) Why does Adam divide by √v + ε?

Implementation Exercises¶

Exercise 4.5: Gradient Descent Variants¶

Implement and compare:

def sgd(params, grads, lr):
    """Basic SGD: θ ← θ - lr * g"""
    # TODO
    pass

def sgd_momentum(params, grads, velocity, lr, beta=0.9):
    """SGD with momentum: v ← βv + g; θ ← θ - lr * v"""
    # TODO
    pass

def sgd_nesterov(params, grads, velocity, lr, beta=0.9):
    """Nesterov momentum: look ahead before computing gradient"""
    # TODO
    pass

Exercise 4.6: Implement Adam¶

Implement Adam optimizer:

class Adam:
    def __init__(self, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8):
        self.lr = lr
        self.beta1 = beta1
        self.beta2 = beta2
        self.eps = eps
        self.t = 0
        self.m = {}  # First moment
        self.v = {}  # Second moment

    def step(self, params, grads):
        """
        m = β₁m + (1-β₁)g
        v = β₂v + (1-β₂)g²
        m̂ = m / (1 - β₁ᵗ)   # Bias correction
        v̂ = v / (1 - β₂ᵗ)
        θ = θ - lr * m̂ / (√v̂ + ε)
        """
        # TODO
        pass

Exercise 4.7: Learning Rate Schedules¶

Implement common schedules:

def constant_lr(step, base_lr):
    return base_lr

def step_decay(step, base_lr, decay_rate=0.1, decay_every=1000):
    """Reduce LR by decay_rate every decay_every steps"""
    # TODO
    pass

def cosine_annealing(step, base_lr, total_steps):
    """Cosine decay from base_lr to 0"""
    # TODO
    pass

def warmup_then_decay(step, base_lr, warmup_steps, total_steps):
    """Linear warmup, then cosine decay"""
    # TODO
    pass

Exercise 4.8: Gradient Clipping¶

Implement gradient clipping:

def clip_grad_norm(grads, max_norm):
    """
    Clip gradients so that global norm ≤ max_norm.

    global_norm = sqrt(sum(g² for all g in grads))
    if global_norm > max_norm:
        scale all gradients by max_norm / global_norm
    """
    # TODO
    pass

def clip_grad_value(grads, max_value):
    """Clip each gradient element to [-max_value, max_value]"""
    # TODO
    pass

Challenge Exercises¶

Exercise 4.9: Optimizer Comparison¶

Train the same model with different optimizers:

a) SGD with different learning rates (0.001, 0.01, 0.1, 1.0) b) SGD + Momentum (β=0.9) c) Adam (default hyperparameters)

Plot training loss curves. Which converges fastest? Most stably?

Exercise 4.10: Learning Rate Finder¶

Implement the LR range test:

def find_lr(model, data, min_lr=1e-7, max_lr=10, num_steps=100):
    """
    1. Start with very small LR
    2. Train one step, record loss
    3. Increase LR exponentially
    4. Stop when loss explodes
    5. Return LR where loss decreased fastest
    """
    # TODO
    pass

Exercise 4.11: Batch Size and Learning Rate¶

Train models with different batch sizes: 16, 32, 64, 128, 256.

a) Use the same learning rate for all. Compare convergence. b) Scale learning rate linearly with batch size. Does this help? c) The "linear scaling rule" says lr ∝ batch_size. Test this.

Checking Your Work¶

Test suite: See code/stage-04/tests/test_optimizers.py for expected behavior
Reference implementation: Compare with code/stage-04/optimizers.py
Self-check: Verify optimizers converge on simple quadratic functions

Mini-Project: Optimizer Comparison¶

Empirically compare optimizers on a challenging optimization landscape.

Requirements¶

Implement: SGD, Momentum, RMSprop, Adam from scratch
Benchmark: Compare on Rosenbrock function and a simple neural network
Visualize: Plot optimization trajectories and loss curves

Deliverables¶

[ ] All 4 optimizers implemented
[ ] Convergence plot (steps vs. loss) for each
[ ] 2D trajectory plot on Rosenbrock function
[ ] Hyperparameter sensitivity analysis (learning rate)

Extension¶

Implement learning rate warmup and cosine decay. How much do they help?