Stage 5: Attention — The Key to Modern LLMs¶

Estimated reading time: 3-4 hours | Prerequisites: Stages 1-4

Overview¶

Attention is the single most important innovation in modern language models. This mechanism allows models to dynamically focus on relevant parts of the input, overcoming the fundamental limitations of fixed-context approaches.

The central question: How can a model learn which parts of the input are relevant for each output?

What You'll Learn¶

By the end of this stage, you'll understand:

Why attention is necessary — The limitations of fixed-context models
Dot-product attention — The mathematical foundation
Scaled attention — Why we divide by √d
Self-attention — A sequence attending to itself
Multi-head attention — Multiple perspectives simultaneously
Positional encoding — Injecting position information
Causal masking — Preventing future information leakage
The Transformer block — Putting it all together

Sections¶

Section	Topic	Key Concepts
5.1	The Attention Problem	Fixed context limits, information bottleneck, alignment
5.2	Dot-Product Attention	Query, Key, Value, attention weights, softmax
5.3	Scaled Attention	Variance analysis, √d scaling, numerical stability
5.4	Self-Attention	Sequence-to-sequence, learned projections
5.5	Multi-Head Attention	Multiple heads, concatenation, different subspaces
5.6	Positional Encoding	Sinusoidal, learned, relative positions, RoPE
5.7	Causal Masking	Autoregressive models, masked attention
5.8	Implementation	Building attention from scratch

Key Mathematical Results¶

Scaled Dot-Product Attention¶

\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V\]

Where:

Q (Query): What am I looking for? [n × d_k]
K (Key): What do I contain? [m × d_k]
V (Value): What should I return? [m × d_v]
d_k: Key/query dimension

Multi-Head Attention¶

\[\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h) W^O\]

where \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\)

We'll derive these formulas step by step and explain why each component is necessary.

Connection to Modern LLMs¶

This stage covers the exact attention mechanism used in:

GPT-4, Claude, LLaMA: Decoder-only transformers with causal self-attention
BERT: Encoder with bidirectional self-attention
T5: Encoder-decoder with cross-attention

Understanding attention is essential for understanding how these models work.

Code Preview¶

def scaled_dot_product_attention(Q, K, V, mask=None):
    """
    Scaled dot-product attention from scratch.

    Args:
        Q: Query matrix [batch, seq_len, d_k]
        K: Key matrix [batch, seq_len, d_k]
        V: Value matrix [batch, seq_len, d_v]
        mask: Optional attention mask

    Returns:
        Attention output [batch, seq_len, d_v]
    """
    d_k = Q.shape[-1]

    # Compute attention scores
    scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)

    # Apply mask (for causal attention)
    if mask is not None:
        scores = scores.masked_fill(mask == 0, float('-inf'))

    # Softmax to get attention weights
    attention_weights = softmax(scores, dim=-1)

    # Weighted sum of values
    return attention_weights @ V

Prerequisites¶

Before starting this stage, ensure you understand:

[ ] Neural network forward/backward passes (Stage 2-3)
[ ] Matrix multiplication and linear algebra
[ ] Softmax function and its properties
[ ] Cross-entropy loss (Stage 3)
[ ] Optimization basics (Stage 4)

The Big Picture¶

Stage 1: Markov         → Fixed context, counting
Stage 2: Autograd       → Learning via gradients
Stage 3: Neural LM      → Continuous representations
Stage 4: Optimization   → Making learning work
Stage 5: Attention      → Dynamic context ← YOU ARE HERE
Stage 6: Transformers   → Full architecture

Attention solves the fundamental limitation of all previous approaches: fixed context windows. With attention, a model can dynamically decide which parts of a potentially unlimited context are relevant.

Historical Context¶

2014: Bahdanau et al. introduce attention for machine translation
2015: Luong et al. simplify attention mechanisms
2017: Vaswani et al. publish "Attention Is All You Need" — the Transformer
2018-present: GPT, BERT, and the age of large language models

Exercises Preview¶

Implement attention: Build scaled dot-product attention from scratch
Visualize patterns: Plot attention weights on sample text
Multi-head ablation: What happens with 1 head vs 8 vs 64?
Positional encoding: Compare sinusoidal vs learned positions
Causal masking: Implement and verify no future information leakage

Begin¶

→ Start with Section 5.1: The Attention Problem

Code & Resources¶

Resource	Description
`code/stage-05/attention.py`	Reference implementation
`code/stage-05/tests/`	Test suite
Exercises	Practice problems
Common Mistakes	Debugging guide