Capstone: End-to-End Transformer Training¶

Putting it all together: A complete, trainable transformer from first principles

Overview¶

This capstone project brings together everything from Stages 1-10 into a single, complete transformer language model that you can train from scratch. Unlike using a framework like PyTorch with autodiff, we implement every component by hand—including the backward pass.

This is the final test of your understanding: if you can follow this code, you truly understand how transformers work at a fundamental level.

What This Demonstrates¶

Stage	Concept	Where It Appears
1	Language modeling, perplexity	Loss function, evaluation
2	Backpropagation	Manual `backward()` methods
3	Neural networks	Embeddings, linear layers
4	Adam optimizer	Training loop
5	Attention	Multi-head self-attention
6	Transformers	Full architecture
7	Tokenization	Character tokenizer (extendable to BPE)
8	Training dynamics	Learning rate schedules, gradient clipping

The Core Insight¶

The key insight of this capstone is that autodiff is not magic. Every backward() method we implement is exactly what PyTorch does automatically. By writing it ourselves, we understand:

What gets cached during forward pass
How gradients flow backward through each operation
Why certain architectural choices matter for gradient flow
The computational cost of training vs. inference

Architecture Choices¶

The model uses modern architectural patterns:

Choice	What It Is	Why It Matters
RMSNorm	Root mean square normalization	Simpler than LayerNorm, faster
SwiGLU	Gated linear unit with SiLU	Better performance than GELU
Pre-norm	Normalize before sublayer	More stable training for deep networks
Tied embeddings	Input and output share weights	50% parameter reduction
Causal masking	Can only attend to past tokens	Enables autoregressive generation

File Structure¶

code/capstone/
├── model.py          # Complete trainable transformer
├── train.py          # Training script with logging
└── tests/
    └── test_capstone.py  # 23 comprehensive tests

Key Components¶

Parameter Container¶

Every learnable parameter is wrapped in a Parameter class that stores both data and gradients:

@dataclass
class Parameter:
    data: np.ndarray
    grad: Optional[np.ndarray] = None

    def zero_grad(self):
        self.grad = np.zeros_like(self.data)

Manual Backward Pass¶

Each layer implements its own backward pass using the chain rule:

class FeedForward:
    def forward(self, x):
        # Cache values needed for backward
        self.cache = {'x': x, 'hidden': hidden, ...}
        return output

    def backward(self, grad_output):
        # Retrieve cached values
        x = self.cache['x']

        # Compute parameter gradients
        self.w1.grad = ...
        self.w2.grad = ...

        # Compute input gradient for chain rule
        return grad_input

Training Loop¶

The complete training loop follows the pattern you learned in Stage 4:

for epoch in range(epochs):
    for inputs, targets in batches:
        # Forward pass
        logits = model.forward(inputs)
        loss, grad = cross_entropy_loss(logits, targets)

        # Backward pass
        model.zero_grad()
        model.backward(grad)

        # Gradient clipping (Stage 8)
        clip_grad_norm(params, max_norm=1.0)

        # Optimizer step (Stage 4)
        optimizer.step()
        scheduler.step()

Model Sizes¶

Configuration	Parameters	Use Case
Tiny (default)	~800K	Learning, debugging
Small	~3M	Character-level text
Medium	~12M	Actual generation

For comparison: - GPT-2 Small: 117M parameters - GPT-2 XL: 1.5B parameters - LLaMA 7B: 7B parameters

Our capstone is a toy model for learning, not production.

Quick Start¶

# Navigate to capstone directory
cd code/capstone

# Run with default settings (trains on Shakespeare)
python train.py

# Custom training
python train.py --epochs 50 --lr 3e-4 --d-model 256 --n-layers 6

# Train on your own text
python train.py --text-file path/to/your/text.txt --epochs 100

# Run tests
python tests/test_capstone.py

Understanding Through Tests¶

The test suite (test_capstone.py) covers:

Utility functions: softmax, silu, causal mask
Component tests: RMSNorm, Attention, FFN
Full model tests: Forward shape, parameter count
Gradient tests: Numerical gradient checking
Training tests: Convergence on simple data

Running the tests is a great way to verify your understanding.

Extending the Capstone¶

After mastering the basics, try these extensions:

Replace CharTokenizer with BPE (Stage 7)
Add training diagnostics (Stage 8)
Implement LoRA fine-tuning (Stage 9)
Add KV-cache for faster generation
Implement Flash Attention for memory efficiency

Connection to Production Systems¶

Everything in this capstone maps directly to production LLMs:

This Capstone	Production (PyTorch/JAX)
`model.forward()`	Same, but compiled
`model.backward()`	Automatic differentiation
`Adam` optimizer	Same algorithm
`WarmupCosine` schedule	Standard practice
`CharTokenizer`	BPE/SentencePiece
4 layers, 128 dim	80+ layers, 4096+ dim
NumPy on CPU	CUDA/TPU tensors

The only differences are: 1. Scale: More layers, larger dimensions 2. Hardware: GPUs/TPUs instead of CPU 3. Engineering: Compiled kernels, distributed training 4. Tokenization: Subword instead of character

Key Takeaways¶

Autodiff is just chain rule automation - We can implement it manually
Caching is essential - Forward pass saves values for backward
Gradient flow matters - Architecture choices affect trainability
Scale is the difference - Same algorithms, more compute
Everything connects - Each stage builds on previous ones

Next Steps¶

Congratulations on completing the capstone! You now have a deep understanding of how language models work. Consider:

Read the GPT-2 paper with fresh eyes
Explore the Hugging Face Transformers library source code
Try training larger models on cloud GPUs
Implement modern improvements like RoPE, GQA, or MoE
Contribute to open-source LLM projects