Skip to content

Stage 6: The Complete Transformer — Putting It All Together

Estimated reading time: 4-5 hours | Prerequisites: Stages 1-5

Overview

This stage brings together everything we've learned to build and understand the complete Transformer architecture. We'll see how attention, embeddings, and optimization combine to create the foundation of modern LLMs like GPT-4, Claude, and LLaMA.

The central question: How do we combine all our components into a trainable language model?

What You'll Learn

By the end of this stage, you'll understand:

  1. Tokenization — How text becomes numbers (BPE, WordPiece)
  2. The Transformer Block — Attention + FFN + Residuals + LayerNorm
  3. Stacking Layers — Building deep networks
  4. Pre-training Objectives — Causal LM, Masked LM, and variants
  5. Training at Scale — Batch size, learning rate, stability
  6. Modern Architectures — GPT, LLaMA, and design choices
  7. Scaling Laws — How performance relates to compute
  8. Implementation — Training a working Transformer

Sections

Section Topic Key Concepts
6.1 Tokenization Subword tokenization, BPE, vocabulary size trade-offs
6.2 The Transformer Block Complete block architecture, residual streams
6.3 Building Deep Networks Layer stacking, initialization, gradient flow
6.4 Pre-training Objectives Causal LM, masked LM, next sentence prediction
6.5 Training at Scale Large batch training, mixed precision, stability
6.6 Modern Architectures GPT, LLaMA, Mistral, architectural choices
6.7 Scaling Laws Chinchilla, compute-optimal training
6.8 Implementation Training a complete Transformer

The Complete Picture

                    ┌─────────────────────────────────────────┐
                    │          Token Embeddings               │
                    │    + Positional Encoding (Stage 5)      │
                    └─────────────────┬───────────────────────┘
           ┌──────────────────────────┼──────────────────────────┐
           │                          ▼                          │
           │    ┌─────────────────────────────────────────┐      │
           │    │         Multi-Head Attention            │      │
           │    │           (Stage 5.5)                   │      │
           │    └─────────────────┬───────────────────────┘      │
           │                      │                              │
           │              ┌───────┴───────┐                      │
     ×N    │              │   Add & Norm  │ ◄── Residual         │
   Layers  │              └───────┬───────┘                      │
           │                      │                              │
           │    ┌─────────────────┴───────────────────────┐      │
           │    │         Feed-Forward Network            │      │
           │    │           (Stage 5.8)                   │      │
           │    └─────────────────┬───────────────────────┘      │
           │                      │                              │
           │              ┌───────┴───────┐                      │
           │              │   Add & Norm  │ ◄── Residual         │
           │              └───────┬───────┘                      │
           └──────────────────────┼──────────────────────────────┘
                    ┌─────────────┴─────────────┐
                    │       Final LayerNorm      │
                    └─────────────┬─────────────┘
                    ┌─────────────┴─────────────┐
                    │    Output Projection       │
                    │   (to vocabulary logits)   │
                    └─────────────┬─────────────┘
                    ┌─────────────┴─────────────┐
                    │  Softmax → Next Token      │
                    │    (Stage 1, 3)            │
                    └───────────────────────────┘

Building on Previous Stages

Stage Contribution to Transformers
Stage 1 Probability foundations, perplexity, temperature sampling
Stage 2 Automatic differentiation for training
Stage 3 Embeddings, cross-entropy loss
Stage 4 Adam optimizer, learning rate schedules
Stage 5 Attention mechanism, positional encoding, masking
Stage 6 Complete architecture and training

Key Architectural Decisions

Modern Transformers involve many design choices:

Decision Options Trade-offs
Normalization Pre-norm vs Post-norm Training stability vs final performance
Activation ReLU, GELU, SwiGLU Speed vs quality
Positional encoding Sinusoidal, Learned, RoPE Generalization vs expressivity
Attention Full, Sliding window, Sparse Context length vs compute
Architecture Encoder-only, Decoder-only, Enc-Dec Task suitability

Code Preview

class Transformer:
    """
    Complete decoder-only Transformer for language modeling.

    This is what GPT, LLaMA, and similar models are built on.
    """

    def __init__(
        self,
        vocab_size: int,
        d_model: int = 512,
        n_heads: int = 8,
        n_layers: int = 6,
        d_ff: int = 2048,
        max_seq_len: int = 512,
        dropout: float = 0.1,
    ):
        # Token + position embeddings
        self.token_embedding = Embedding(vocab_size, d_model)
        self.pos_encoding = SinusoidalPositionalEncoding(max_seq_len, d_model)

        # Stack of Transformer blocks
        self.layers = [
            TransformerBlock(d_model, n_heads, d_ff, dropout)
            for _ in range(n_layers)
        ]

        # Final projection to vocabulary
        self.final_norm = LayerNorm(d_model)
        self.output_proj = Linear(d_model, vocab_size)

    def forward(self, tokens, mask=None):
        # Embed tokens
        x = self.token_embedding(tokens)
        x = x + self.pos_encoding(len(tokens))

        # Apply causal mask
        if mask is None:
            mask = create_causal_mask(len(tokens))

        # Pass through layers
        for layer in self.layers:
            x = layer(x, mask)

        # Project to vocabulary
        x = self.final_norm(x)
        logits = self.output_proj(x)

        return logits

Prerequisites

Before starting this stage, ensure you understand:

  • [ ] Attention mechanism (Stage 5)
  • [ ] Multi-head attention (Stage 5.5)
  • [ ] Positional encoding (Stage 5.6)
  • [ ] Causal masking (Stage 5.7)
  • [ ] Adam optimizer (Stage 4.5)
  • [ ] Learning rate schedules (Stage 4.6)
  • [ ] Cross-entropy loss (Stage 3.4)

The Big Picture

Stage 1: Markov         → Fixed context, counting
Stage 2: Autograd       → Learning via gradients
Stage 3: Neural LM      → Continuous representations
Stage 4: Optimization   → Making learning work
Stage 5: Attention      → Dynamic context
Stage 6: Transformers   → Complete architecture ← YOU ARE HERE

This stage represents the culmination of our journey from first principles. After this, you'll understand the complete architecture behind modern LLMs.

Historical Context

  • 2017: Vaswani et al. publish "Attention Is All You Need"
  • 2018: GPT-1 demonstrates pre-training + fine-tuning
  • 2019: GPT-2 shows emergent capabilities at scale
  • 2020: GPT-3 (175B parameters) enables few-shot learning
  • 2022: ChatGPT brings LLMs to the mainstream
  • 2023-24: GPT-4, Claude, LLaMA 2/3, Mistral push boundaries

Exercises Preview

  1. Implement a Transformer: Build the complete architecture from scratch
  2. Train on text: Pre-train on a small corpus (Shakespeare, code, etc.)
  3. Ablation study: What happens with fewer layers? Fewer heads?
  4. Tokenizer comparison: Compare character-level vs BPE
  5. Scaling experiment: Plot loss vs compute for different model sizes

Begin

Start with Section 6.1: Tokenization

Code & Resources

Resource Description
code/stage-06/transformer.py Reference implementation
code/stage-06/tests/ Test suite
Exercises Practice problems
Common Mistakes Debugging guide