Stage 5 Exercises¶

Conceptual Questions¶

Exercise 5.1: Attention Intuition¶

Consider the sentence: "The cat sat on the mat because it was tired."

a) When predicting "tired", which words should receive high attention? b) What does "it" most likely refer to? How would attention help resolve this? c) Why is this hard for a fixed-window model?

Exercise 5.2: Scaling Factor¶

In scaled dot-product attention, we divide by √d_k.

a) If d_k = 64, what is √d_k? b) Without scaling, what happens to dot products as d_k increases? c) How does this affect the softmax distribution?

Exercise 5.3: Multi-Head Purpose¶

Why use 8 heads of dimension 64 instead of 1 head of dimension 512?

a) What can multiple heads learn that a single head cannot? b) Is the parameter count the same? c) Give an example of different "types" of attention different heads might learn.

Exercise 5.4: Causal Masking¶

For sequence ["The", "cat", "sat"]:

a) Draw the attention matrix (which positions can attend to which?) b) What value goes in the masked positions before softmax? c) Why is causal masking necessary for language modeling?

Implementation Exercises¶

Exercise 5.5: Dot-Product Attention¶

Implement basic attention:

def attention(Q: np.ndarray, K: np.ndarray, V: np.ndarray) -> np.ndarray:
    """
    Args:
        Q: queries [batch, seq_q, d_k]
        K: keys [batch, seq_k, d_k]
        V: values [batch, seq_k, d_v]

    Returns:
        output [batch, seq_q, d_v]
    """
    # scores = Q @ K^T / sqrt(d_k)
    # weights = softmax(scores)
    # output = weights @ V
    # TODO
    pass

Exercise 5.6: Causal Mask¶

Implement causal masking:

def create_causal_mask(seq_len: int) -> np.ndarray:
    """
    Create a mask where position i can only attend to positions <= i.

    Returns:
        mask [seq_len, seq_len] with 0 for allowed, -inf for blocked
    """
    # TODO
    pass

def masked_attention(Q, K, V, mask):
    """Apply attention with mask"""
    # TODO: add mask to scores before softmax
    pass

Exercise 5.7: Multi-Head Attention¶

Implement full multi-head attention:

class MultiHeadAttention:
    def __init__(self, d_model: int, n_heads: int):
        self.n_heads = n_heads
        self.d_k = d_model // n_heads

        self.W_q = np.random.randn(d_model, d_model) * 0.02
        self.W_k = np.random.randn(d_model, d_model) * 0.02
        self.W_v = np.random.randn(d_model, d_model) * 0.02
        self.W_o = np.random.randn(d_model, d_model) * 0.02

    def forward(self, x):
        """
        1. Project to Q, K, V
        2. Split into heads
        3. Apply attention per head
        4. Concatenate and project
        """
        # TODO
        pass

Exercise 5.8: Positional Encoding¶

Implement sinusoidal positional encoding:

def positional_encoding(seq_len: int, d_model: int) -> np.ndarray:
    """
    PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
    PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

    Returns:
        encoding [seq_len, d_model]
    """
    # TODO
    pass

Challenge Exercises¶

Exercise 5.9: Attention Visualization¶

Train a small attention model and visualize the attention patterns:

a) Create a heatmap of attention weights b) Do different heads focus on different positions? c) Does one head learn to attend to the previous word?

Exercise 5.10: Attention Backward Pass¶

Implement the backward pass through attention:

def attention_backward(Q, K, V, grad_output):
    """
    Given gradient of loss w.r.t. output, compute gradients for Q, K, V.

    Returns:
        (grad_Q, grad_K, grad_V)
    """
    # Forward values (from cache)
    scores = Q @ K.T / np.sqrt(d_k)
    attn = softmax(scores)
    # output = attn @ V

    # Backward:
    # TODO: compute gradients
    pass

Exercise 5.11: Relative Positional Encoding¶

Research and implement relative positional encoding (as used in Transformer-XL):

def relative_attention(Q, K, V, pos_embed):
    """
    Attention with relative position representations.

    Instead of absolute position, encode relative distance between tokens.
    """
    # TODO: Advanced exercise
    pass

Checking Your Work¶

Test suite: See code/stage-05/tests/test_attention.py for expected behavior
Reference implementation: Compare with code/stage-05/attention.py
Self-check: Verify attention weights sum to 1 and causal mask blocks future positions

Mini-Project: Attention Visualizer¶

Build a tool that visualizes attention patterns in trained models.

Requirements¶

Model: Train a small attention-based language model
Extraction: Extract attention weights during inference
Visualization: Create heatmaps of attention patterns

Deliverables¶

[ ] Trained attention model (even if small)
[ ] Attention weight extraction function
[ ] Heatmap visualization for sample sentences
[ ] Analysis: What patterns do you observe?

Extension¶

Compare attention patterns across different heads. Do different heads specialize?