Skip to content

Stage 6 Exercises

Conceptual Questions

Exercise 6.1: Residual Connections

Consider a network without residual connections vs with them.

a) For a 100-layer network, what happens to gradients without residuals? b) How do residuals help? (Hint: what's the gradient of f(x) + x?) c) Why is this called a "residual" connection?

Exercise 6.2: Layer Normalization

LayerNorm normalizes across features, BatchNorm across the batch.

a) Why is LayerNorm preferred for transformers? b) What happens to BatchNorm with batch_size=1? c) What are the learnable parameters in LayerNorm?

Exercise 6.3: Pre-norm vs Post-norm

In pre-norm: x + Attention(LayerNorm(x)) In post-norm: LayerNorm(x + Attention(x))

a) Which is used in GPT-2? GPT-3? LLaMA? b) Why has pre-norm become more popular for large models? c) What's the trade-off?

Exercise 6.4: Scaling Laws

The scaling law says: Loss ≈ C / N^α where N is parameter count.

a) If we double parameters, how much does loss decrease? b) If we want to halve the loss, how many times more parameters do we need? c) Why do these laws break down eventually?


Implementation Exercises

Exercise 6.5: Transformer Block

Implement a single transformer block:

class TransformerBlock:
    def __init__(self, d_model, n_heads, d_ff):
        self.attention = MultiHeadAttention(d_model, n_heads)
        self.ffn = FeedForward(d_model, d_ff)
        self.norm1 = LayerNorm(d_model)
        self.norm2 = LayerNorm(d_model)

    def forward(self, x, mask=None):
        """
        Pre-norm transformer block:
        x = x + Attention(Norm(x))
        x = x + FFN(Norm(x))
        """
        # TODO
        pass

Exercise 6.6: Feed-Forward Network

Implement the FFN sublayer:

class FeedForward:
    def __init__(self, d_model, d_ff):
        self.W1 = np.random.randn(d_model, d_ff) * 0.02
        self.W2 = np.random.randn(d_ff, d_model) * 0.02
        self.b1 = np.zeros(d_ff)
        self.b2 = np.zeros(d_model)

    def forward(self, x):
        """FFN(x) = W2 * GELU(W1 * x + b1) + b2"""
        # TODO
        pass

Exercise 6.7: Layer Normalization

Implement LayerNorm:

class LayerNorm:
    def __init__(self, dim, eps=1e-5):
        self.gamma = np.ones(dim)
        self.beta = np.zeros(dim)
        self.eps = eps

    def forward(self, x):
        """
        mean = x.mean(axis=-1)
        var = x.var(axis=-1)
        x_norm = (x - mean) / sqrt(var + eps)
        return gamma * x_norm + beta
        """
        # TODO
        pass

Exercise 6.8: Full Transformer Stack

Build a complete transformer:

class Transformer:
    def __init__(self, vocab_size, d_model, n_heads, n_layers, d_ff, max_len):
        self.embed = Embedding(vocab_size, d_model)
        self.pos_embed = PositionalEncoding(max_len, d_model)
        self.layers = [
            TransformerBlock(d_model, n_heads, d_ff)
            for _ in range(n_layers)
        ]
        self.output = Linear(d_model, vocab_size)

    def forward(self, tokens):
        """tokens [batch, seq] -> logits [batch, seq, vocab]"""
        # TODO
        pass

Challenge Exercises

Exercise 6.9: Parameter Counting

For a transformer with: - vocab_size = 50,000 - d_model = 768 - n_heads = 12 - n_layers = 12 - d_ff = 3072

a) How many parameters in the embedding layer? b) How many in each attention sublayer? c) How many in each FFN sublayer? d) Total parameters?

Compare your answer to GPT-2 small (117M parameters).

Exercise 6.10: RoPE Implementation

Implement Rotary Position Embeddings (RoPE):

def apply_rope(x: np.ndarray, positions: np.ndarray) -> np.ndarray:
    """
    Apply rotary position embeddings.

    Instead of adding position, rotate the embedding based on position.
    """
    # TODO: Research and implement
    pass

Exercise 6.11: SwiGLU Activation

Implement the SwiGLU activation (used in LLaMA, PaLM):

class SwiGLU_FFN:
    def __init__(self, d_model, d_ff):
        # Note: SwiGLU has 3 weight matrices, not 2
        self.W_gate = ...
        self.W_up = ...
        self.W_down = ...

    def forward(self, x):
        """
        gate = silu(x @ W_gate)
        up = x @ W_up
        return (gate * up) @ W_down
        """
        # TODO
        pass

Checking Your Work

  • Test suite: See code/stage-06/tests/test_transformer.py for expected behavior
  • Reference implementation: Compare with code/stage-06/transformer.py
  • Self-check: Verify output shapes match expectations and gradients flow correctly

Mini-Project: Tiny Transformer

Build a complete, working transformer language model from scratch.

Requirements

  1. Architecture: At least 2 layers, 2 heads, 64 dimensions
  2. Training: Train on a text corpus until loss < 2.0
  3. Generation: Generate coherent text

Deliverables

  • [ ] Complete transformer implementation
  • [ ] Training script with logging
  • [ ] Loss curve showing convergence
  • [ ] Generated text samples
  • [ ] Parameter count breakdown

Extension

Compare pre-norm vs. post-norm. Which trains more stably?