Section 2.5: Forward Mode vs Reverse Mode¶

There are two fundamentally different ways to apply the chain rule through a computational graph. The choice between them has massive implications for efficiency.

Bottom line: For neural networks (many inputs, one output), reverse mode is exponentially faster. This is why backpropagation works.

The Two Modes¶

Both modes compute exact derivatives. They differ in direction:

Mode	Computes	Direction	Efficient when
Forward	∂(all outputs)/∂(one input)	input → output	few inputs
Reverse	∂(one output)/∂(all inputs)	output → input	few outputs

The following diagram illustrates the fundamental difference:

flowchart TB
    subgraph "Forward Mode (one pass per input)"
        direction LR
        FI1["x₁<br/>∂·/∂x₁ = 1"]:::seed --> FG["Graph"] --> FO["∂f/∂x₁"]
        FI2["x₂<br/>∂·/∂x₂ = 0"] --> FG
        FI3["x₃<br/>∂·/∂x₃ = 0"] --> FG
    end

    subgraph "Reverse Mode (one pass for ALL inputs)"
        direction LR
        RI1["x₁"] --> RG["Graph"] --> RO["f<br/>∂f/∂f = 1"]:::seed
        RI2["x₂"] --> RG
        RI3["x₃"] --> RG
        RO --> |"backward"| RR["∂f/∂x₁, ∂f/∂x₂, ∂f/∂x₃"]:::result
    end

    classDef seed fill:#fff9c4,stroke:#f57f17,stroke-width:2px
    classDef result fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px

Key difference: Forward mode requires n passes for n inputs. Reverse mode requires 1 pass regardless of input count.

Forward Mode: Intuition¶

In forward mode, we pick one input xᵢ and compute how it affects everything downstream.

We propagate derivatives forward alongside values:

Each node stores: (value, ∂value/∂xᵢ)
This pair flows through the graph
At the end, we have ∂output/∂xᵢ

To get derivatives with respect to all inputs, we need n forward passes (one per input).

Forward Mode: Algorithm¶

Input: Computational graph, input values, index i of input to differentiate

Process:

Set derivative of xᵢ to 1, all other inputs to 0
For each node v in topological order:
Compute v's value from parents (standard forward pass)
Compute ∂v/∂xᵢ using chain rule:

\[\frac{\partial v}{\partial x_i} = \sum_{p \in \text{parents}(v)} \frac{\partial v}{\partial p} \cdot \frac{\partial p}{\partial x_i}\]

Output: ∂output/∂xᵢ

Forward Mode: Example¶

Consider f(x₁, x₂) = x₁·x₂ + sin(x₁).

Let x₁ = 2, x₂ = 3. Let's compute ∂f/∂x₁.

Graph:

x₁ ──┬──────▶ [×] ──▶ a ──┐
     │         ▲          │
     │         │          ▼
     │        x₂         [+] ──▶ f
     │                    ▲
     └──▶ [sin] ──▶ b ────┘

Forward pass with derivatives (dx means ∂/∂x₁):

Node	Value	Derivative ∂·/∂x₁
x₁	2	1 (seed)
x₂	3	0
a = x₁·x₂	6	x₂·1 + x₁·0 = 3
b = sin(x₁)	sin(2) ≈ 0.91	cos(x₁)·1 = cos(2) ≈ -0.42
f = a + b	6.91	3 + (-0.42) = 2.58

So ∂f/∂x₁ ≈ 2.58.

To get ∂f/∂x₂, we need another forward pass with seed (0, 1).

Dual Numbers: The Math of Forward Mode¶

Forward mode has an elegant mathematical formulation using dual numbers.

A dual number has the form: a + bε where ε² = 0 but ε ≠ 0.

This is like complex numbers but with ε² = 0 instead of i² = -1.

Key property: f(a + bε) = f(a) + f'(a)·b·ε

The ε coefficient automatically carries the derivative!

Example: (2 + 1·ε)² = 4 + 4ε + ε² = 4 + 4ε (since ε² = 0)

This matches: d/dx(x²) at x=2 is 2·2 = 4.

Reverse Mode: Intuition¶

In reverse mode, we pick one output and compute how all inputs affect it.

We propagate gradients backward from output to inputs:

Start with ∂output/∂output = 1
Work backward through the graph
Accumulate ∂output/∂v for each node v

One backward pass gives derivatives with respect to all inputs!

Reverse Mode: Algorithm¶

Input: Computational graph (already evaluated), output node

Process:

Forward pass: compute all values
Set gradient of output to 1
For each node v in reverse topological order:
For each parent p of v:

\[\frac{\partial \text{output}}{\partial p} \mathrel{+}= \frac{\partial \text{output}}{\partial v} \cdot \frac{\partial v}{\partial p}\]

Output: ∂output/∂(all inputs)

Reverse Mode: Example¶

Same function: f(x₁, x₂) = x₁·x₂ + sin(x₁), with x₁ = 2, x₂ = 3.

Forward pass (compute values): - a = x₁·x₂ = 6 - b = sin(x₁) = sin(2) ≈ 0.91 - f = a + b ≈ 6.91

Backward pass:

Step	Node	Gradient ∂f/∂·	How
1	f	1	seed
2	a	1	∂f/∂a = 1 (from addition)
3	b	1	∂f/∂b = 1 (from addition)
4	x₁ (via a)	x₂ = 3	∂f/∂a · ∂a/∂x₁ = 1 · 3
5	x₂ (via a)	x₁ = 2	∂f/∂a · ∂a/∂x₂ = 1 · 2
6	x₁ (via b)	cos(2) ≈ -0.42	∂f/∂b · ∂b/∂x₁ = 1 · cos(2)
7	x₁ (total)	3 + (-0.42) ≈ 2.58	sum both paths

Results from ONE backward pass:

∂f/∂x₁ ≈ 2.58
∂f/∂x₂ = 2

Complexity Comparison¶

Consider a function f: ℝⁿ → ℝᵐ (n inputs, m outputs).

Mode	Passes needed	Total work
Forward	n passes	O(n · graph size)
Reverse	m passes	O(m · graph size)

For neural networks:

n = millions of parameters
m = 1 (scalar loss)
Forward mode: millions of passes
Reverse mode: ONE pass

flowchart LR
    subgraph "Neural Network: n = 1,000,000 parameters, m = 1 loss"
        direction TB
        FM["Forward Mode<br/>1,000,000 passes"]:::bad
        RM["Reverse Mode<br/>1 pass"]:::good
    end

    FM --> |"1,000,000× slower"| RM

    classDef bad fill:#ffcdd2,stroke:#c62828,stroke-width:2px
    classDef good fill:#c8e6c9,stroke:#2e7d32,stroke-width:2px

Reverse mode wins by a factor of n!

This is why backpropagation (reverse mode autodiff) is the standard for training neural networks.

The Jacobian Perspective¶

For f: ℝⁿ → ℝᵐ, the Jacobian is an m × n matrix:

\[J_{ij} = \frac{\partial f_i}{\partial x_j}\]

Forward mode: Computes one column of J per pass (how one input affects all outputs)
Reverse mode: Computes one row of J per pass (how all inputs affect one output)

For a loss function L: ℝⁿ → ℝ, the Jacobian is 1 × n (a row vector = the gradient). Reverse mode computes the entire gradient in one pass.

When to Use Each Mode¶

Scenario	n (inputs)	m (outputs)	Best mode
Neural network training	millions	1	Reverse
Sensitivity analysis	few	many	Forward
Jacobian-vector product	any	any	Depends on direction
Scientific computing	varies	varies	Often forward

Machine learning almost always uses reverse mode because we have one scalar loss.

Vector-Jacobian Products (VJP) vs Jacobian-Vector Products (JVP)¶

The two modes can be understood as different matrix-vector products:

Forward mode (JVP): Given vector v (tangent), compute J·v - "If inputs change by v, how do outputs change?"

Reverse mode (VJP): Given vector u (cotangent), compute uᵀ·J - "To change output by u, how much does each input contribute?"

For scalar loss with u = 1:

VJP gives the full gradient
This is exactly what we need for gradient descent

Memory Trade-offs¶

Mode	Memory requirement
Forward	O(1) extra (just carry derivatives)
Reverse	O(graph size) (must cache forward pass values)

Reverse mode needs to store intermediate values for the backward pass. For deep networks, this can be significant.

Techniques to reduce memory:

Gradient checkpointing: Recompute some values instead of storing
Trading compute for memory

A Larger Example¶

Consider a 3-layer neural network:

x → [W₁] → relu → [W₂] → relu → [W₃] → y → [loss] → L

Parameters: W₁, W₂, W₃ (possibly millions of numbers) Output: L (scalar)

Forward mode: To get ∂L/∂Wᵢⱼ for each weight, we need one pass per weight. With 1 million weights = 1 million forward passes.

Reverse mode: One forward pass + one backward pass = 2 total passes. Done.

The Adjoint Method¶

Reverse mode is also called the adjoint method in scientific computing.

The key insight: instead of propagating "how does this input affect output" forward, propagate "how does output depend on this intermediate value" backward.

This is a duality—both compute the same derivatives, just in different orders.

Summary¶

Aspect	Forward Mode	Reverse Mode
Direction	Input → Output	Output → Input
Computes	∂(all)/∂(one input)	∂(one output)/∂(all)
Passes for ℝⁿ→ℝ	n	1
Memory	Low	Must cache forward pass
Math	Dual numbers	Adjoint method
Used in	Sensitivity analysis	Deep learning

The key insight: Neural network training involves computing ∂(scalar loss)/∂(all parameters). Reverse mode does this in O(1) passes; forward mode takes O(parameters) passes. This makes reverse mode essential for deep learning.

Exercises¶

Forward vs reverse count: For f: ℝ¹⁰⁰ → ℝ¹⁰, how many passes does each mode need to compute the full Jacobian?
Dual numbers: Using dual numbers, compute d/dx(x³) at x = 2 by evaluating (2 + ε)³.
Memory analysis: In a 100-layer network, roughly how many intermediate values need to be cached for reverse mode?
When forward wins: Give an example where forward mode is more efficient than reverse mode.
Thinking question: If we have f: ℝⁿ → ℝⁿ and need the full Jacobian, which mode is better? Can we do better than n passes?

What's Next¶

We understand the theory. Now let's build it.

In Section 2.6, we'll implement a complete automatic differentiation system from scratch—supporting forward pass recording and reverse mode gradient computation.