Section 6.6: Modern Architectures — GPT, LLaMA, and Beyond¶
Reading time: 20 minutes | Difficulty: ★★★☆☆
While the core Transformer architecture has remained stable, modern LLMs incorporate numerous refinements. This section surveys the architectural choices in state-of-the-art models.
The Evolution of Architectures¶
2017: Original Transformer (encoder-decoder)
2018: GPT-1 (decoder-only)
2019: GPT-2 (larger, pre-norm)
2020: GPT-3 (175B, few-shot learning)
2022: ChatGPT (RLHF-aligned)
2023: GPT-4, LLaMA, Claude 2
2024: LLaMA 3, Claude 3, Mistral, Mixtral
Each generation brings refinements that improve efficiency, capability, or both.
Decoder-Only vs Encoder-Decoder¶
Encoder-Decoder (T5, BART)¶
Encoder (bidirectional):
Input: "Translate to French: Hello"
Output: Encoded representations
Decoder (causal):
Input: Encoded + "Bonjour"
Output: Next token probabilities
Pros: Bidirectional encoding, good for translation/summarization Cons: More complex, two separate components
Decoder-Only (GPT, LLaMA, Claude)¶
Pros: Simpler, unified, scales well Cons: No bidirectional context (but seems fine at scale)
Winner: Decoder-only has become dominant for general-purpose LLMs.
GPT Architecture¶
The original influential decoder-only design:
GPT-2 Specifics¶
GPT2Config = {
# Sizes: small, medium, large, xl
'd_model': [768, 1024, 1280, 1600],
'n_layers': [12, 24, 36, 48],
'n_heads': [12, 16, 20, 25],
'vocab_size': 50257,
'max_seq_len': 1024,
# Architecture
'activation': 'gelu',
'layer_norm': 'pre_norm',
'tie_embeddings': True, # Input/output embeddings shared
}
Key GPT Innovations¶
| Feature | Details |
|---|---|
| Pre-norm | LayerNorm before attention/FFN |
| GELU activation | Smoother than ReLU |
| Learned positions | Not sinusoidal |
| BPE tokenization | 50K vocabulary |
LLaMA Architecture¶
Meta's open-weight models with modern refinements:
LLaMA 2 Specifics¶
LLaMA2Config = {
# 7B, 13B, 70B variants
'd_model': [4096, 5120, 8192],
'n_layers': [32, 40, 80],
'n_heads': [32, 40, 64],
'd_ff': [11008, 13824, 28672], # ~2.7× d_model (not 4×)
'vocab_size': 32000,
'max_seq_len': 4096,
# Architecture
'activation': 'silu', # SwiGLU variant
'layer_norm': 'rmsnorm', # Simpler than LayerNorm
'positional': 'rope', # Rotary embeddings
'tie_embeddings': False,
}
LLaMA Innovations¶
| Feature | Improvement Over GPT |
|---|---|
| RMSNorm | Faster, simpler |
| SwiGLU | Better quality |
| RoPE | Better length generalization |
| Grouped-Query Attention (GQA) | Faster inference |
Grouped-Query Attention¶
Standard: Each attention head has its own Q, K, V
GQA: Multiple query heads share K, V heads
Standard MHA (8 heads):
Q: 8 heads × d_k
K: 8 heads × d_k
V: 8 heads × d_k
GQA (8 query heads, 2 KV heads):
Q: 8 heads × d_k
K: 2 heads × d_k (shared by 4 query heads each)
V: 2 heads × d_k
Benefit: Smaller KV cache, faster inference
Mistral Architecture¶
Efficient model with sliding window attention:
Mistral 7B Specifics¶
MistralConfig = {
'd_model': 4096,
'n_layers': 32,
'n_heads': 32,
'n_kv_heads': 8, # GQA
'vocab_size': 32000,
'max_seq_len': 8192,
'sliding_window': 4096, # Local attention window
'activation': 'silu',
'layer_norm': 'rmsnorm',
'positional': 'rope',
}
Sliding Window Attention¶
Instead of attending to all previous tokens:
Full attention (O(n²)):
Token 1000 attends to tokens 0-999
Sliding window (O(n×w)):
Token 1000 attends to tokens 996-999 (window=4)
BUT: Through multiple layers, information propagates:
Layer 1: see 4 tokens back
Layer 2: see 8 tokens back (4 × 2)
Layer 32: see 128 tokens back
This enables long contexts with less compute.
Mixtral (Mixture of Experts)¶
Sparse model that activates only part of the network:
MoE Architecture¶
Input
│
▼
┌───────────────┐
│ Router │ → Selects 2 of 8 experts
└───────────────┘
│
┌───────┬─────┼─────┬───────┐
│ │ │ │ │
▼ ▼ ▼ ▼ ▼
Exp1 Exp2 Exp3 ... Exp8
│ │ │ │ │
└───────┴─────┼─────┴───────┘
│
Weighted sum
│
▼
Output
MoE Benefits¶
| Aspect | Dense Model | MoE Model |
|---|---|---|
| Parameters | 7B | 46B (8 experts) |
| Active params | 7B | 12B (2 experts) |
| Quality | Good | Better (more params) |
| Inference speed | Baseline | Similar (same active params) |
MoE provides "more parameters for the same compute."
Modern Architectural Components¶
Activation Functions¶
# ReLU (original)
def relu(x):
return max(0, x)
# GELU (GPT-2, BERT)
def gelu(x):
return 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715*x³)))
# SiLU/Swish (LLaMA)
def silu(x):
return x * sigmoid(x)
# SwiGLU (LLaMA FFN)
def swiglu(x, W1, W2, W3):
return (silu(x @ W1) * (x @ W3)) @ W2
SwiGLU requires an extra weight matrix but improves quality.
Normalization¶
# LayerNorm (GPT)
def layer_norm(x, gamma, beta):
mean = x.mean(-1, keepdim=True)
var = x.var(-1, keepdim=True)
return gamma * (x - mean) / sqrt(var + eps) + beta
# RMSNorm (LLaMA) - no mean subtraction, no beta
def rms_norm(x, gamma):
rms = sqrt((x ** 2).mean(-1, keepdim=True))
return gamma * x / (rms + eps)
RMSNorm is ~7% faster with similar quality.
Positional Encoding¶
# Learned (GPT)
pos_emb = Embedding(max_len, d_model)
# Sinusoidal (original Transformer)
# See Stage 5.6
# RoPE (LLaMA) - rotates query/key vectors
def rope(q, k, positions):
# Rotate Q, K based on position
# Dot product encodes relative position
return rotate(q, positions), rotate(k, positions)
# ALiBi (alternative)
# Add distance-based bias to attention scores
scores = Q @ K.T - m * abs(i - j)
RoPE has become the dominant choice for modern LLMs.
Architecture Comparison¶
| Component | GPT-2 | LLaMA 2 | Mistral | Mixtral |
|---|---|---|---|---|
| Attention | Full | Full | Sliding Window | Full |
| KV Heads | MHA | GQA | GQA | GQA |
| FFN | GELU | SwiGLU | SwiGLU | SwiGLU (MoE) |
| Norm | LayerNorm | RMSNorm | RMSNorm | RMSNorm |
| Position | Learned | RoPE | RoPE | RoPE |
Emerging Techniques¶
Flash Attention¶
Not an architecture change, but an efficient attention implementation:
Standard attention: O(n²) memory
Flash attention: O(n) memory, same result
Key idea: Compute attention in tiles, never materialize full n×n matrix
This enables much longer contexts (100K+ tokens).
State Space Models (Mamba)¶
Alternative to attention with linear scaling:
Still being explored, may complement or replace attention.
Multimodal Integration¶
Modern models handle text + images + audio:
Image: → Vision Encoder → Tokens
Text: → Tokenizer → Tokens
Audio: → Audio Encoder → Tokens
All → Unified Transformer → Output
Architecture is extended but core Transformer remains.
Architectural Trade-offs¶
Quality vs Efficiency¶
More parameters → Better quality but slower
More layers → Better but harder to train
Larger d_model → Better but more memory
MoE → Better quality for same compute
GQA → Faster but slightly worse
Length vs Compute¶
Full attention: Great quality, O(n²) compute
Sliding window: Good quality, O(n×w) compute
Linear attention: Unknown quality, O(n) compute
Open vs Closed Weights¶
| Model | Open Weights | API Access |
|---|---|---|
| GPT-4 | No | Yes |
| Claude | No | Yes |
| LLaMA 2/3 | Yes | Also Yes |
| Mistral | Yes | Also Yes |
Open weights enable research and customization.
Connection to Modern LLMs
The current frontier:
- GPT-4: Rumored MoE with 8 experts, ~1.8T total params
- Claude 3: Architecture not disclosed
- LLaMA 3: Open weights, very strong performance
- Mistral Large: Competitive with GPT-4
The best architectures keep improving, but the core Transformer block remains recognizable.
Exercises¶
-
Implement RMSNorm: Replace LayerNorm with RMSNorm. Compare speed.
-
Implement GQA: Modify multi-head attention to share KV heads.
-
Sliding window: Implement sliding window attention. Test on long sequences.
-
Activation comparison: Compare ReLU, GELU, SiLU on a small task.
-
RoPE vs learned: Compare rotary and learned positional encodings.
Summary¶
| Model | Key Innovation | Impact |
|---|---|---|
| GPT | Decoder-only, pre-norm | Foundation of modern LLMs |
| LLaMA | RMSNorm, SwiGLU, RoPE | Efficient open model |
| Mistral | Sliding window, GQA | Long context efficiency |
| Mixtral | MoE | More params, same compute |
Key takeaway: Modern LLM architectures build on the original Transformer with incremental improvements: RMSNorm for speed, SwiGLU for quality, RoPE for length generalization, and GQA for inference efficiency. More radical changes like MoE and sliding window attention address scaling challenges. Despite continuous refinement, the fundamental attention + FFN + residual structure remains unchanged since 2017.
→ Next: Section 6.7: Scaling Laws