Stage 6: The Complete Transformer — Putting It All Together¶

Estimated reading time: 4-5 hours | Prerequisites: Stages 1-5

Overview¶

This stage brings together everything we've learned to build and understand the complete Transformer architecture. We'll see how attention, embeddings, and optimization combine to create the foundation of modern LLMs like GPT-4, Claude, and LLaMA.

The central question: How do we combine all our components into a trainable language model?

What You'll Learn¶

By the end of this stage, you'll understand:

Tokenization — How text becomes numbers (BPE, WordPiece)
The Transformer Block — Attention + FFN + Residuals + LayerNorm
Stacking Layers — Building deep networks
Pre-training Objectives — Causal LM, Masked LM, and variants
Training at Scale — Batch size, learning rate, stability
Modern Architectures — GPT, LLaMA, and design choices
Scaling Laws — How performance relates to compute
Implementation — Training a working Transformer

Sections¶

Section	Topic	Key Concepts
6.1	Tokenization	Subword tokenization, BPE, vocabulary size trade-offs
6.2	The Transformer Block	Complete block architecture, residual streams
6.3	Building Deep Networks	Layer stacking, initialization, gradient flow
6.4	Pre-training Objectives	Causal LM, masked LM, next sentence prediction
6.5	Training at Scale	Large batch training, mixed precision, stability
6.6	Modern Architectures	GPT, LLaMA, Mistral, architectural choices
6.7	Scaling Laws	Chinchilla, compute-optimal training
6.8	Implementation	Training a complete Transformer

The Complete Picture¶

                    ┌─────────────────────────────────────────┐
                    │          Token Embeddings               │
                    │    + Positional Encoding (Stage 5)      │
                    └─────────────────┬───────────────────────┘
                                      │
           ┌──────────────────────────┼──────────────────────────┐
           │                          ▼                          │
           │    ┌─────────────────────────────────────────┐      │
           │    │         Multi-Head Attention            │      │
           │    │           (Stage 5.5)                   │      │
           │    └─────────────────┬───────────────────────┘      │
           │                      │                              │
           │              ┌───────┴───────┐                      │
     ×N    │              │   Add & Norm  │ ◄── Residual         │
   Layers  │              └───────┬───────┘                      │
           │                      │                              │
           │    ┌─────────────────┴───────────────────────┐      │
           │    │         Feed-Forward Network            │      │
           │    │           (Stage 5.8)                   │      │
           │    └─────────────────┬───────────────────────┘      │
           │                      │                              │
           │              ┌───────┴───────┐                      │
           │              │   Add & Norm  │ ◄── Residual         │
           │              └───────┬───────┘                      │
           └──────────────────────┼──────────────────────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    │       Final LayerNorm      │
                    └─────────────┬─────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    │    Output Projection       │
                    │   (to vocabulary logits)   │
                    └─────────────┬─────────────┘
                                  │
                    ┌─────────────┴─────────────┐
                    │  Softmax → Next Token      │
                    │    (Stage 1, 3)            │
                    └───────────────────────────┘

Building on Previous Stages¶

Stage	Contribution to Transformers
Stage 1	Probability foundations, perplexity, temperature sampling
Stage 2	Automatic differentiation for training
Stage 3	Embeddings, cross-entropy loss
Stage 4	Adam optimizer, learning rate schedules
Stage 5	Attention mechanism, positional encoding, masking
Stage 6	Complete architecture and training

Key Architectural Decisions¶

Modern Transformers involve many design choices:

Decision	Options	Trade-offs
Normalization	Pre-norm vs Post-norm	Training stability vs final performance
Activation	ReLU, GELU, SwiGLU	Speed vs quality
Positional encoding	Sinusoidal, Learned, RoPE	Generalization vs expressivity
Attention	Full, Sliding window, Sparse	Context length vs compute
Architecture	Encoder-only, Decoder-only, Enc-Dec	Task suitability

Code Preview¶

class Transformer:
    """
    Complete decoder-only Transformer for language modeling.

    This is what GPT, LLaMA, and similar models are built on.
    """

    def __init__(
        self,
        vocab_size: int,
        d_model: int = 512,
        n_heads: int = 8,
        n_layers: int = 6,
        d_ff: int = 2048,
        max_seq_len: int = 512,
        dropout: float = 0.1,
    ):
        # Token + position embeddings
        self.token_embedding = Embedding(vocab_size, d_model)
        self.pos_encoding = SinusoidalPositionalEncoding(max_seq_len, d_model)

        # Stack of Transformer blocks
        self.layers = [
            TransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ]

        # Final projection to vocabulary
        self.final_norm = LayerNorm(d_model)
        self.output_proj = Linear(d_model, vocab_size)

    def forward(self, tokens, mask=None):
        # Embed tokens
        x = self.token_embedding(tokens)
        x = x + self.pos_encoding(len(tokens))

        # Apply causal mask
        if mask is None:
            mask = create_causal_mask(len(tokens))

        # Pass through layers
        for layer in self.layers:
            x = layer(x, mask)

        # Project to vocabulary
        x = self.final_norm(x)
        logits = self.output_proj(x)

        return logits

Prerequisites¶

Before starting this stage, ensure you understand:

[ ] Attention mechanism (Stage 5)
[ ] Multi-head attention (Stage 5.5)
[ ] Positional encoding (Stage 5.6)
[ ] Causal masking (Stage 5.7)
[ ] Adam optimizer (Stage 4.5)
[ ] Learning rate schedules (Stage 4.6)
[ ] Cross-entropy loss (Stage 3.4)

The Big Picture¶

Stage 1: Markov         → Fixed context, counting
Stage 2: Autograd       → Learning via gradients
Stage 3: Neural LM      → Continuous representations
Stage 4: Optimization   → Making learning work
Stage 5: Attention      → Dynamic context
Stage 6: Transformers   → Complete architecture ← YOU ARE HERE

This stage represents the culmination of our journey from first principles. After this, you'll understand the complete architecture behind modern LLMs.

Historical Context¶

2017: Vaswani et al. publish "Attention Is All You Need"
2018: GPT-1 demonstrates pre-training + fine-tuning
2019: GPT-2 shows emergent capabilities at scale
2020: GPT-3 (175B parameters) enables few-shot learning
2022: ChatGPT brings LLMs to the mainstream
2023-24: GPT-4, Claude, LLaMA 2/3, Mistral push boundaries

Exercises Preview¶

Implement a Transformer: Build the complete architecture from scratch
Train on text: Pre-train on a small corpus (Shakespeare, code, etc.)
Ablation study: What happens with fewer layers? Fewer heads?
Tokenizer comparison: Compare character-level vs BPE
Scaling experiment: Plot loss vs compute for different model sizes

Begin¶

→ Start with Section 6.1: Tokenization

Code & Resources¶

Resource	Description
`code/stage-06/transformer.py`	Reference implementation
`code/stage-06/tests/`	Test suite
Exercises	Practice problems
Common Mistakes	Debugging guide