Stage 2: Automatic Differentiation¶
From Calculus to Code: Building the Foundation of Deep Learning¶
In Stage 1, we built a Markov chain language model and found optimal parameters through counting—a closed-form solution. Neural networks are fundamentally different: there's no closed-form solution. We must search for good parameters by iteratively improving them.
This search requires knowing: if I change a parameter slightly, how does the output change?
This is the domain of automatic differentiation—the technique that makes training neural networks possible. By the end of this stage, you'll understand exactly how PyTorch and TensorFlow compute gradients, because you'll have built the same system from scratch.
What We'll Build¶
A complete automatic differentiation engine that can:
- Track computations as they happen
- Build computational graphs automatically
- Compute gradients via reverse-mode differentiation
- Train neural networks using gradient descent
Sections¶
2.1: What is a Derivative?¶
The geometric and algebraic foundations. Why derivatives matter for optimization, and how they connect to machine learning.
2.2: Derivative Rules from First Principles¶
Deriving the power, product, quotient, and exponential rules from the limit definition. We don't just state rules—we prove them.
2.3: The Chain Rule — The Heart of Backpropagation¶
The most important derivative rule for deep learning. How derivatives chain through compositions, and why this leads directly to backpropagation.
2.4: Computational Graphs¶
Representing computation as directed acyclic graphs. Forward passes, backward passes, and gradient accumulation.
2.5: Forward Mode vs Reverse Mode¶
Two fundamentally different ways to apply the chain rule. Why reverse mode is exponentially faster for neural networks.
2.6: Building Autograd from Scratch¶
~100 lines of code that implement complete automatic differentiation. Building and training neural networks with our own engine.
2.7: Testing and Validation¶
How to verify your gradients are correct. Numerical checking, property-based testing, and debugging strategies.
Prerequisites¶
- Basic calculus (we'll derive everything from limits)
- Python programming
- Completion of Stage 1 (for context on why we need this)
Key Takeaways¶
By the end of this stage, you will understand:
- Derivatives from first principles: Not just rules to memorize, but why they work
- The chain rule deeply: How it enables differentiating any composition
- Computational graphs: The data structure behind modern deep learning
- Why reverse mode wins: The complexity analysis that explains backpropagation's efficiency
- How to build autograd: The ~100 lines of core code that power gradient computation
- Testing gradients: Essential techniques for verifying correctness
The Journey So Far¶
| Stage | Topic | Key Insight |
|---|---|---|
| 1 | Markov Chains | Language modeling is probability estimation over sequences |
| 2 | Automatic Differentiation | Gradients enable iterative optimization—no closed-form needed |
| 3 | (Coming) | Building our first neural language model |
Let's Begin¶
The derivative is where it all starts. Understanding it deeply—not just as a formula, but as a concept—unlocks everything that follows.
→ Start with Section 2.1: What is a Derivative?
Code & Resources¶
| Resource | Description |
|---|---|
code/stage-02/value.py |
Reference implementation |
code/stage-02/tests/ |
Test suite |
| Exercises | Practice problems |
| Common Mistakes | Debugging guide |