Skip to content

Building LLMs from First Principles

A rigorous, bottom-up approach to understanding language models.

This book derives every concept from first principles. No hand-waving. No "it's well known that..." Every formula is explained, every claim is proven.

What Makes This Different

Most LLM tutorials tell you what to do. This book shows you why it works:

  • Full mathematical derivations - Chain rule proved by induction, MLE derived with Lagrange multipliers, smoothing derived from Bayesian priors
  • Code from scratch - Every algorithm implemented in pure NumPy with comprehensive test suites
  • First principles pedagogy - Each concept builds only on what's already been covered
  • Exercises & common mistakes - Practice problems and debugging guides for every stage
  • Modern connections - See how Markov chains connect directly to GPT-4 and Claude
  • Interactive tools - Explore concepts with live visualizations

The Complete Journey

Stage Topic Key Concepts Exercises
1 Markov Chains Probability, MLE, perplexity, temperature 11
2 Automatic Differentiation Derivatives, chain rule, autograd 11
3 Neural Language Models Embeddings, softmax, cross-entropy 11
4 Optimization SGD, momentum, Adam, learning rate schedules 11
5 Attention Dot-product attention, multi-head, causal masking 11
6 The Complete Transformer Transformer blocks, LayerNorm, scaling laws 11
7 Tokenization BPE, WordPiece, Unigram, vocabulary size 12
8 Training Dynamics Loss curves, gradient statistics, debugging 12
9 Parameter-Efficient Fine-Tuning LoRA, adapters, prefix tuning 12
10 Alignment Reward modeling, RLHF, DPO 12
Capstone End-to-End Transformer Complete trainable model from scratch -

Learning Paths

Choose a path based on your goals:

The Fundamentals Path (Stages 1-6)

Best for: Understanding how transformers work

Progress through the core stages in order. By the end, you'll understand: - How language models predict next tokens - Why attention is the key innovation - What makes transformers trainable and scalable

Time estimate: 20-30 hours of focused study

The Practitioner Path (Stages 7-10)

Best for: People who want to fine-tune and deploy models

After completing fundamentals, focus on: - Stage 7: How tokenization affects model performance - Stage 8: Debugging training issues - Stage 9: Fine-tuning without training all parameters - Stage 10: Aligning models with human preferences

Prerequisites: Stages 1-6 or equivalent experience

The Deep Dive Path (All Stages + Capstone)

Best for: Researchers and those building from scratch

Complete all stages and the capstone project, which involves: - Implementing every backward pass manually - Training a complete transformer on real text - Understanding exactly what autodiff does under the hood

Time estimate: 40-60 hours

Prerequisites

  • Basic Python programming
  • High school algebra
  • Curiosity about how things work

No deep learning experience required. We build everything from the ground up.

How to Use This Book

Each stage is self-contained but builds on previous stages:

  1. Read the theory - Understand the mathematical foundations
  2. Study the code - See how theory translates to implementation
  3. Do the exercises - Solidify understanding through practice
  4. Review common mistakes - Learn from typical errors
  5. Reflect - Connect new concepts to the bigger picture

The book follows Polya's problem-solving method:

  • Understand the problem
  • Devise a plan
  • Execute the plan
  • Reflect on the solution

What You'll Build

By the end of this book, you will have implemented:

  • A Markov chain text generator
  • An automatic differentiation engine
  • A neural language model with embeddings
  • Optimizers (SGD, Adam) and learning rate schedulers
  • Multi-head self-attention from scratch
  • A complete transformer architecture
  • BPE tokenization
  • Training diagnostics and debugging tools
  • LoRA fine-tuning
  • DPO alignment training
  • A complete trainable transformer with manual backpropagation

Get Started


More from the Author

The First Principles Trilogy

This book is part of a series teaching ML fundamentals from first principles:

📘 Building LLMs from First Principles (You are here) Learn how transformers work by building them from scratch—full math derivations, working code, and comprehensive test suites. From Markov chains to GPT.

🔬 Mechanistic Interpretability from First Principles Reverse-engineer neural networks to understand their internal algorithms. Features, superposition, circuits, and sparse autoencoders explained from the ground up.

The Algebra of Speed Mathematical foundations of computational performance. Why FlashAttention, LoRA, and quantization work—and how to recognize when similar optimizations apply to your problems.

Blog

✍️ Software Bits — Short, focused essays on performance, ML, and computer science fundamentals. Subscribe for updates.

💻 GitHub: perf-bits — Blog posts with full code and interactive demos.