Stage 5: Attention — The Key to Modern LLMs¶
Estimated reading time: 3-4 hours | Prerequisites: Stages 1-4
Overview¶
Attention is the single most important innovation in modern language models. This mechanism allows models to dynamically focus on relevant parts of the input, overcoming the fundamental limitations of fixed-context approaches.
The central question: How can a model learn which parts of the input are relevant for each output?
What You'll Learn¶
By the end of this stage, you'll understand:
- Why attention is necessary — The limitations of fixed-context models
- Dot-product attention — The mathematical foundation
- Scaled attention — Why we divide by √d
- Self-attention — A sequence attending to itself
- Multi-head attention — Multiple perspectives simultaneously
- Positional encoding — Injecting position information
- Causal masking — Preventing future information leakage
- The Transformer block — Putting it all together
Sections¶
| Section | Topic | Key Concepts |
|---|---|---|
| 5.1 | The Attention Problem | Fixed context limits, information bottleneck, alignment |
| 5.2 | Dot-Product Attention | Query, Key, Value, attention weights, softmax |
| 5.3 | Scaled Attention | Variance analysis, √d scaling, numerical stability |
| 5.4 | Self-Attention | Sequence-to-sequence, learned projections |
| 5.5 | Multi-Head Attention | Multiple heads, concatenation, different subspaces |
| 5.6 | Positional Encoding | Sinusoidal, learned, relative positions, RoPE |
| 5.7 | Causal Masking | Autoregressive models, masked attention |
| 5.8 | Implementation | Building attention from scratch |
Key Mathematical Results¶
Scaled Dot-Product Attention¶
Where:
- Q (Query): What am I looking for? [n × d_k]
- K (Key): What do I contain? [m × d_k]
- V (Value): What should I return? [m × d_v]
- d_k: Key/query dimension
Multi-Head Attention¶
where \(\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)\)
We'll derive these formulas step by step and explain why each component is necessary.
Connection to Modern LLMs¶
This stage covers the exact attention mechanism used in:
- GPT-4, Claude, LLaMA: Decoder-only transformers with causal self-attention
- BERT: Encoder with bidirectional self-attention
- T5: Encoder-decoder with cross-attention
Understanding attention is essential for understanding how these models work.
Code Preview¶
def scaled_dot_product_attention(Q, K, V, mask=None):
"""
Scaled dot-product attention from scratch.
Args:
Q: Query matrix [batch, seq_len, d_k]
K: Key matrix [batch, seq_len, d_k]
V: Value matrix [batch, seq_len, d_v]
mask: Optional attention mask
Returns:
Attention output [batch, seq_len, d_v]
"""
d_k = Q.shape[-1]
# Compute attention scores
scores = Q @ K.transpose(-2, -1) / math.sqrt(d_k)
# Apply mask (for causal attention)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
# Softmax to get attention weights
attention_weights = softmax(scores, dim=-1)
# Weighted sum of values
return attention_weights @ V
Prerequisites¶
Before starting this stage, ensure you understand:
- [ ] Neural network forward/backward passes (Stage 2-3)
- [ ] Matrix multiplication and linear algebra
- [ ] Softmax function and its properties
- [ ] Cross-entropy loss (Stage 3)
- [ ] Optimization basics (Stage 4)
The Big Picture¶
Stage 1: Markov → Fixed context, counting
Stage 2: Autograd → Learning via gradients
Stage 3: Neural LM → Continuous representations
Stage 4: Optimization → Making learning work
Stage 5: Attention → Dynamic context ← YOU ARE HERE
Stage 6: Transformers → Full architecture
Attention solves the fundamental limitation of all previous approaches: fixed context windows. With attention, a model can dynamically decide which parts of a potentially unlimited context are relevant.
Historical Context¶
- 2014: Bahdanau et al. introduce attention for machine translation
- 2015: Luong et al. simplify attention mechanisms
- 2017: Vaswani et al. publish "Attention Is All You Need" — the Transformer
- 2018-present: GPT, BERT, and the age of large language models
Exercises Preview¶
- Implement attention: Build scaled dot-product attention from scratch
- Visualize patterns: Plot attention weights on sample text
- Multi-head ablation: What happens with 1 head vs 8 vs 64?
- Positional encoding: Compare sinusoidal vs learned positions
- Causal masking: Implement and verify no future information leakage
Begin¶
→ Start with Section 5.1: The Attention Problem
Code & Resources¶
| Resource | Description |
|---|---|
code/stage-05/attention.py |
Reference implementation |
code/stage-05/tests/ |
Test suite |
| Exercises | Practice problems |
| Common Mistakes | Debugging guide |