Stage 6: The Complete Transformer — Putting It All Together¶
Estimated reading time: 4-5 hours | Prerequisites: Stages 1-5
Overview¶
This stage brings together everything we've learned to build and understand the complete Transformer architecture. We'll see how attention, embeddings, and optimization combine to create the foundation of modern LLMs like GPT-4, Claude, and LLaMA.
The central question: How do we combine all our components into a trainable language model?
What You'll Learn¶
By the end of this stage, you'll understand:
- Tokenization — How text becomes numbers (BPE, WordPiece)
- The Transformer Block — Attention + FFN + Residuals + LayerNorm
- Stacking Layers — Building deep networks
- Pre-training Objectives — Causal LM, Masked LM, and variants
- Training at Scale — Batch size, learning rate, stability
- Modern Architectures — GPT, LLaMA, and design choices
- Scaling Laws — How performance relates to compute
- Implementation — Training a working Transformer
Sections¶
| Section | Topic | Key Concepts |
|---|---|---|
| 6.1 | Tokenization | Subword tokenization, BPE, vocabulary size trade-offs |
| 6.2 | The Transformer Block | Complete block architecture, residual streams |
| 6.3 | Building Deep Networks | Layer stacking, initialization, gradient flow |
| 6.4 | Pre-training Objectives | Causal LM, masked LM, next sentence prediction |
| 6.5 | Training at Scale | Large batch training, mixed precision, stability |
| 6.6 | Modern Architectures | GPT, LLaMA, Mistral, architectural choices |
| 6.7 | Scaling Laws | Chinchilla, compute-optimal training |
| 6.8 | Implementation | Training a complete Transformer |
The Complete Picture¶
┌─────────────────────────────────────────┐
│ Token Embeddings │
│ + Positional Encoding (Stage 5) │
└─────────────────┬───────────────────────┘
│
┌──────────────────────────┼──────────────────────────┐
│ ▼ │
│ ┌─────────────────────────────────────────┐ │
│ │ Multi-Head Attention │ │
│ │ (Stage 5.5) │ │
│ └─────────────────┬───────────────────────┘ │
│ │ │
│ ┌───────┴───────┐ │
×N │ │ Add & Norm │ ◄── Residual │
Layers │ └───────┬───────┘ │
│ │ │
│ ┌─────────────────┴───────────────────────┐ │
│ │ Feed-Forward Network │ │
│ │ (Stage 5.8) │ │
│ └─────────────────┬───────────────────────┘ │
│ │ │
│ ┌───────┴───────┐ │
│ │ Add & Norm │ ◄── Residual │
│ └───────┬───────┘ │
└──────────────────────┼──────────────────────────────┘
│
┌─────────────┴─────────────┐
│ Final LayerNorm │
└─────────────┬─────────────┘
│
┌─────────────┴─────────────┐
│ Output Projection │
│ (to vocabulary logits) │
└─────────────┬─────────────┘
│
┌─────────────┴─────────────┐
│ Softmax → Next Token │
│ (Stage 1, 3) │
└───────────────────────────┘
Building on Previous Stages¶
| Stage | Contribution to Transformers |
|---|---|
| Stage 1 | Probability foundations, perplexity, temperature sampling |
| Stage 2 | Automatic differentiation for training |
| Stage 3 | Embeddings, cross-entropy loss |
| Stage 4 | Adam optimizer, learning rate schedules |
| Stage 5 | Attention mechanism, positional encoding, masking |
| Stage 6 | Complete architecture and training |
Key Architectural Decisions¶
Modern Transformers involve many design choices:
| Decision | Options | Trade-offs |
|---|---|---|
| Normalization | Pre-norm vs Post-norm | Training stability vs final performance |
| Activation | ReLU, GELU, SwiGLU | Speed vs quality |
| Positional encoding | Sinusoidal, Learned, RoPE | Generalization vs expressivity |
| Attention | Full, Sliding window, Sparse | Context length vs compute |
| Architecture | Encoder-only, Decoder-only, Enc-Dec | Task suitability |
Code Preview¶
class Transformer:
"""
Complete decoder-only Transformer for language modeling.
This is what GPT, LLaMA, and similar models are built on.
"""
def __init__(
self,
vocab_size: int,
d_model: int = 512,
n_heads: int = 8,
n_layers: int = 6,
d_ff: int = 2048,
max_seq_len: int = 512,
dropout: float = 0.1,
):
# Token + position embeddings
self.token_embedding = Embedding(vocab_size, d_model)
self.pos_encoding = SinusoidalPositionalEncoding(max_seq_len, d_model)
# Stack of Transformer blocks
self.layers = [
TransformerBlock(d_model, n_heads, d_ff, dropout)
for _ in range(n_layers)
]
# Final projection to vocabulary
self.final_norm = LayerNorm(d_model)
self.output_proj = Linear(d_model, vocab_size)
def forward(self, tokens, mask=None):
# Embed tokens
x = self.token_embedding(tokens)
x = x + self.pos_encoding(len(tokens))
# Apply causal mask
if mask is None:
mask = create_causal_mask(len(tokens))
# Pass through layers
for layer in self.layers:
x = layer(x, mask)
# Project to vocabulary
x = self.final_norm(x)
logits = self.output_proj(x)
return logits
Prerequisites¶
Before starting this stage, ensure you understand:
- [ ] Attention mechanism (Stage 5)
- [ ] Multi-head attention (Stage 5.5)
- [ ] Positional encoding (Stage 5.6)
- [ ] Causal masking (Stage 5.7)
- [ ] Adam optimizer (Stage 4.5)
- [ ] Learning rate schedules (Stage 4.6)
- [ ] Cross-entropy loss (Stage 3.4)
The Big Picture¶
Stage 1: Markov → Fixed context, counting
Stage 2: Autograd → Learning via gradients
Stage 3: Neural LM → Continuous representations
Stage 4: Optimization → Making learning work
Stage 5: Attention → Dynamic context
Stage 6: Transformers → Complete architecture ← YOU ARE HERE
This stage represents the culmination of our journey from first principles. After this, you'll understand the complete architecture behind modern LLMs.
Historical Context¶
- 2017: Vaswani et al. publish "Attention Is All You Need"
- 2018: GPT-1 demonstrates pre-training + fine-tuning
- 2019: GPT-2 shows emergent capabilities at scale
- 2020: GPT-3 (175B parameters) enables few-shot learning
- 2022: ChatGPT brings LLMs to the mainstream
- 2023-24: GPT-4, Claude, LLaMA 2/3, Mistral push boundaries
Exercises Preview¶
- Implement a Transformer: Build the complete architecture from scratch
- Train on text: Pre-train on a small corpus (Shakespeare, code, etc.)
- Ablation study: What happens with fewer layers? Fewer heads?
- Tokenizer comparison: Compare character-level vs BPE
- Scaling experiment: Plot loss vs compute for different model sizes
Begin¶
→ Start with Section 6.1: Tokenization
Code & Resources¶
| Resource | Description |
|---|---|
code/stage-06/transformer.py |
Reference implementation |
code/stage-06/tests/ |
Test suite |
| Exercises | Practice problems |
| Common Mistakes | Debugging guide |