Building LLMs from First Principles¶
A rigorous, bottom-up approach to understanding language models.
This book derives every concept from first principles. No hand-waving. No "it's well known that..." Every formula is explained, every claim is proven.
What Makes This Different¶
Most LLM tutorials tell you what to do. This book shows you why it works:
- Full mathematical derivations - Chain rule proved by induction, MLE derived with Lagrange multipliers, smoothing derived from Bayesian priors
- Code from scratch - Every algorithm implemented in pure NumPy with comprehensive test suites
- First principles pedagogy - Each concept builds only on what's already been covered
- Exercises & common mistakes - Practice problems and debugging guides for every stage
- Modern connections - See how Markov chains connect directly to GPT-4 and Claude
- Interactive tools - Explore concepts with live visualizations
The Complete Journey¶
| Stage | Topic | Key Concepts | Exercises |
|---|---|---|---|
| 1 | Markov Chains | Probability, MLE, perplexity, temperature | 11 |
| 2 | Automatic Differentiation | Derivatives, chain rule, autograd | 11 |
| 3 | Neural Language Models | Embeddings, softmax, cross-entropy | 11 |
| 4 | Optimization | SGD, momentum, Adam, learning rate schedules | 11 |
| 5 | Attention | Dot-product attention, multi-head, causal masking | 11 |
| 6 | The Complete Transformer | Transformer blocks, LayerNorm, scaling laws | 11 |
| 7 | Tokenization | BPE, WordPiece, Unigram, vocabulary size | 12 |
| 8 | Training Dynamics | Loss curves, gradient statistics, debugging | 12 |
| 9 | Parameter-Efficient Fine-Tuning | LoRA, adapters, prefix tuning | 12 |
| 10 | Alignment | Reward modeling, RLHF, DPO | 12 |
| Capstone | End-to-End Transformer | Complete trainable model from scratch | - |
Learning Paths¶
Choose a path based on your goals:
The Fundamentals Path (Stages 1-6)¶
Best for: Understanding how transformers work
Progress through the core stages in order. By the end, you'll understand: - How language models predict next tokens - Why attention is the key innovation - What makes transformers trainable and scalable
Time estimate: 20-30 hours of focused study
The Practitioner Path (Stages 7-10)¶
Best for: People who want to fine-tune and deploy models
After completing fundamentals, focus on: - Stage 7: How tokenization affects model performance - Stage 8: Debugging training issues - Stage 9: Fine-tuning without training all parameters - Stage 10: Aligning models with human preferences
Prerequisites: Stages 1-6 or equivalent experience
The Deep Dive Path (All Stages + Capstone)¶
Best for: Researchers and those building from scratch
Complete all stages and the capstone project, which involves: - Implementing every backward pass manually - Training a complete transformer on real text - Understanding exactly what autodiff does under the hood
Time estimate: 40-60 hours
Quick Links¶
- Glossary - Key terms and notation reference
- Troubleshooting - When things go wrong
- Interactive Tools - Autograd visualizer, temperature explorer
- Further Reading - Papers, libraries, and external resources
- Capstone Project - Put it all together
Prerequisites¶
- Basic Python programming
- High school algebra
- Curiosity about how things work
No deep learning experience required. We build everything from the ground up.
How to Use This Book¶
Each stage is self-contained but builds on previous stages:
- Read the theory - Understand the mathematical foundations
- Study the code - See how theory translates to implementation
- Do the exercises - Solidify understanding through practice
- Review common mistakes - Learn from typical errors
- Reflect - Connect new concepts to the bigger picture
The book follows Polya's problem-solving method:
- Understand the problem
- Devise a plan
- Execute the plan
- Reflect on the solution
What You'll Build¶
By the end of this book, you will have implemented:
- A Markov chain text generator
- An automatic differentiation engine
- A neural language model with embeddings
- Optimizers (SGD, Adam) and learning rate schedulers
- Multi-head self-attention from scratch
- A complete transformer architecture
- BPE tokenization
- Training diagnostics and debugging tools
- LoRA fine-tuning
- DPO alignment training
- A complete trainable transformer with manual backpropagation
Get Started¶
-
New to ML?
Start from the beginning with probability and Markov chains.
-
Know the basics?
Jump to attention and transformers.
-
Want to fine-tune?
Learn modern fine-tuning techniques.
-
Ready to build?
Dive into the capstone project.
More from the Author¶
The First Principles Trilogy¶
This book is part of a series teaching ML fundamentals from first principles:
📘 Building LLMs from First Principles (You are here) Learn how transformers work by building them from scratch—full math derivations, working code, and comprehensive test suites. From Markov chains to GPT.
🔬 Mechanistic Interpretability from First Principles Reverse-engineer neural networks to understand their internal algorithms. Features, superposition, circuits, and sparse autoencoders explained from the ground up.
⚡ The Algebra of Speed Mathematical foundations of computational performance. Why FlashAttention, LoRA, and quantization work—and how to recognize when similar optimizations apply to your problems.
Blog¶
✍️ Software Bits — Short, focused essays on performance, ML, and computer science fundamentals. Subscribe for updates.
💻 GitHub: perf-bits — Blog posts with full code and interactive demos.