Building LLMs from First Principles¶

A rigorous, bottom-up approach to understanding language models.

This book derives every concept from first principles. No hand-waving. No "it's well known that..." Every formula is explained, every claim is proven.

What Makes This Different¶

Most LLM tutorials tell you what to do. This book shows you why it works:

Full mathematical derivations - Chain rule proved by induction, MLE derived with Lagrange multipliers, smoothing derived from Bayesian priors
Code from scratch - Every algorithm implemented in pure NumPy with comprehensive test suites
First principles pedagogy - Each concept builds only on what's already been covered
Exercises & common mistakes - Practice problems and debugging guides for every stage
Modern connections - See how Markov chains connect directly to GPT-4 and Claude
Interactive tools - Explore concepts with live visualizations

The Complete Journey¶

Stage	Topic	Key Concepts	Exercises
1	Markov Chains	Probability, MLE, perplexity, temperature	11
2	Automatic Differentiation	Derivatives, chain rule, autograd	11
3	Neural Language Models	Embeddings, softmax, cross-entropy	11
4	Optimization	SGD, momentum, Adam, learning rate schedules	11
5	Attention	Dot-product attention, multi-head, causal masking	11
6	The Complete Transformer	Transformer blocks, LayerNorm, scaling laws	11
7	Tokenization	BPE, WordPiece, Unigram, vocabulary size	12
8	Training Dynamics	Loss curves, gradient statistics, debugging	12
9	Parameter-Efficient Fine-Tuning	LoRA, adapters, prefix tuning	12
10	Alignment	Reward modeling, RLHF, DPO	12
Capstone	End-to-End Transformer	Complete trainable model from scratch	-

Learning Paths¶

Choose a path based on your goals:

The Fundamentals Path (Stages 1-6)¶

Best for: Understanding how transformers work

Progress through the core stages in order. By the end, you'll understand: - How language models predict next tokens - Why attention is the key innovation - What makes transformers trainable and scalable

Time estimate: 20-30 hours of focused study

The Practitioner Path (Stages 7-10)¶

Best for: People who want to fine-tune and deploy models

After completing fundamentals, focus on: - Stage 7: How tokenization affects model performance - Stage 8: Debugging training issues - Stage 9: Fine-tuning without training all parameters - Stage 10: Aligning models with human preferences

Prerequisites: Stages 1-6 or equivalent experience

The Deep Dive Path (All Stages + Capstone)¶

Best for: Researchers and those building from scratch

Complete all stages and the capstone project, which involves: - Implementing every backward pass manually - Training a complete transformer on real text - Understanding exactly what autodiff does under the hood

Time estimate: 40-60 hours

Quick Links¶

Glossary - Key terms and notation reference
Troubleshooting - When things go wrong
Interactive Tools - Autograd visualizer, temperature explorer
Further Reading - Papers, libraries, and external resources
Capstone Project - Put it all together

Prerequisites¶

Basic Python programming
High school algebra
Curiosity about how things work

No deep learning experience required. We build everything from the ground up.

How to Use This Book¶

Each stage is self-contained but builds on previous stages:

Read the theory - Understand the mathematical foundations
Study the code - See how theory translates to implementation
Do the exercises - Solidify understanding through practice
Review common mistakes - Learn from typical errors
Reflect - Connect new concepts to the bigger picture

The book follows Polya's problem-solving method:

Understand the problem
Devise a plan
Execute the plan
Reflect on the solution

What You'll Build¶

By the end of this book, you will have implemented:

A Markov chain text generator
An automatic differentiation engine
A neural language model with embeddings
Optimizers (SGD, Adam) and learning rate schedulers
Multi-head self-attention from scratch
A complete transformer architecture
BPE tokenization
Training diagnostics and debugging tools
LoRA fine-tuning
DPO alignment training
A complete trainable transformer with manual backpropagation

Get Started¶

New to ML?

Start from the beginning with probability and Markov chains.

Stage 1: Markov Chains
Know the basics?

Jump to attention and transformers.

Stage 5: Attention
Want to fine-tune?

Learn modern fine-tuning techniques.

Stage 9: PEFT
Ready to build?

Dive into the capstone project.

Capstone

More from the Author¶

The First Principles Trilogy¶

This book is part of a series teaching ML fundamentals from first principles:

📘 Building LLMs from First Principles (You are here) Learn how transformers work by building them from scratch—full math derivations, working code, and comprehensive test suites. From Markov chains to GPT.

🔬 Mechanistic Interpretability from First Principles Reverse-engineer neural networks to understand their internal algorithms. Features, superposition, circuits, and sparse autoencoders explained from the ground up.

⚡ The Algebra of Speed Mathematical foundations of computational performance. Why FlashAttention, LoRA, and quantization work—and how to recognize when similar optimizations apply to your problems.

Blog¶

✍️ Software Bits — Short, focused essays on performance, ML, and computer science fundamentals. Subscribe for updates.

💻 GitHub: perf-bits — Blog posts with full code and interactive demos.