Stage 6 Exercises¶
Conceptual Questions¶
Exercise 6.1: Residual Connections¶
Consider a network without residual connections vs with them.
a) For a 100-layer network, what happens to gradients without residuals? b) How do residuals help? (Hint: what's the gradient of f(x) + x?) c) Why is this called a "residual" connection?
Exercise 6.2: Layer Normalization¶
LayerNorm normalizes across features, BatchNorm across the batch.
a) Why is LayerNorm preferred for transformers? b) What happens to BatchNorm with batch_size=1? c) What are the learnable parameters in LayerNorm?
Exercise 6.3: Pre-norm vs Post-norm¶
In pre-norm: x + Attention(LayerNorm(x))
In post-norm: LayerNorm(x + Attention(x))
a) Which is used in GPT-2? GPT-3? LLaMA? b) Why has pre-norm become more popular for large models? c) What's the trade-off?
Exercise 6.4: Scaling Laws¶
The scaling law says: Loss ≈ C / N^α where N is parameter count.
a) If we double parameters, how much does loss decrease? b) If we want to halve the loss, how many times more parameters do we need? c) Why do these laws break down eventually?
Implementation Exercises¶
Exercise 6.5: Transformer Block¶
Implement a single transformer block:
class TransformerBlock:
def __init__(self, d_model, n_heads, d_ff):
self.attention = MultiHeadAttention(d_model, n_heads)
self.ffn = FeedForward(d_model, d_ff)
self.norm1 = LayerNorm(d_model)
self.norm2 = LayerNorm(d_model)
def forward(self, x, mask=None):
"""
Pre-norm transformer block:
x = x + Attention(Norm(x))
x = x + FFN(Norm(x))
"""
# TODO
pass
Exercise 6.6: Feed-Forward Network¶
Implement the FFN sublayer:
class FeedForward:
def __init__(self, d_model, d_ff):
self.W1 = np.random.randn(d_model, d_ff) * 0.02
self.W2 = np.random.randn(d_ff, d_model) * 0.02
self.b1 = np.zeros(d_ff)
self.b2 = np.zeros(d_model)
def forward(self, x):
"""FFN(x) = W2 * GELU(W1 * x + b1) + b2"""
# TODO
pass
Exercise 6.7: Layer Normalization¶
Implement LayerNorm:
class LayerNorm:
def __init__(self, dim, eps=1e-5):
self.gamma = np.ones(dim)
self.beta = np.zeros(dim)
self.eps = eps
def forward(self, x):
"""
mean = x.mean(axis=-1)
var = x.var(axis=-1)
x_norm = (x - mean) / sqrt(var + eps)
return gamma * x_norm + beta
"""
# TODO
pass
Exercise 6.8: Full Transformer Stack¶
Build a complete transformer:
class Transformer:
def __init__(self, vocab_size, d_model, n_heads, n_layers, d_ff, max_len):
self.embed = Embedding(vocab_size, d_model)
self.pos_embed = PositionalEncoding(max_len, d_model)
self.layers = [
TransformerBlock(d_model, n_heads, d_ff)
for _ in range(n_layers)
]
self.output = Linear(d_model, vocab_size)
def forward(self, tokens):
"""tokens [batch, seq] -> logits [batch, seq, vocab]"""
# TODO
pass
Challenge Exercises¶
Exercise 6.9: Parameter Counting¶
For a transformer with: - vocab_size = 50,000 - d_model = 768 - n_heads = 12 - n_layers = 12 - d_ff = 3072
a) How many parameters in the embedding layer? b) How many in each attention sublayer? c) How many in each FFN sublayer? d) Total parameters?
Compare your answer to GPT-2 small (117M parameters).
Exercise 6.10: RoPE Implementation¶
Implement Rotary Position Embeddings (RoPE):
def apply_rope(x: np.ndarray, positions: np.ndarray) -> np.ndarray:
"""
Apply rotary position embeddings.
Instead of adding position, rotate the embedding based on position.
"""
# TODO: Research and implement
pass
Exercise 6.11: SwiGLU Activation¶
Implement the SwiGLU activation (used in LLaMA, PaLM):
class SwiGLU_FFN:
def __init__(self, d_model, d_ff):
# Note: SwiGLU has 3 weight matrices, not 2
self.W_gate = ...
self.W_up = ...
self.W_down = ...
def forward(self, x):
"""
gate = silu(x @ W_gate)
up = x @ W_up
return (gate * up) @ W_down
"""
# TODO
pass
Checking Your Work¶
- Test suite: See
code/stage-06/tests/test_transformer.pyfor expected behavior - Reference implementation: Compare with
code/stage-06/transformer.py - Self-check: Verify output shapes match expectations and gradients flow correctly
Mini-Project: Tiny Transformer¶
Build a complete, working transformer language model from scratch.
Requirements¶
- Architecture: At least 2 layers, 2 heads, 64 dimensions
- Training: Train on a text corpus until loss < 2.0
- Generation: Generate coherent text
Deliverables¶
- [ ] Complete transformer implementation
- [ ] Training script with logging
- [ ] Loss curve showing convergence
- [ ] Generated text samples
- [ ] Parameter count breakdown
Extension¶
Compare pre-norm vs. post-norm. Which trains more stably?