Stage 3 Exercises¶
Conceptual Questions¶
Exercise 3.1: Embedding Geometry¶
Word embeddings place words in a continuous space.
a) If "king" - "man" + "woman" ≈ "queen", what does this tell us about the embedding space? b) Why can't we do this with one-hot encodings? c) What would "Paris" - "France" + "Germany" approximate?
Exercise 3.2: Hidden Layer Purpose¶
A neural language model has architecture: Embedding → Hidden → Softmax
a) What does the hidden layer learn to do? b) What happens if we remove it (Embedding → Softmax directly)? c) What happens if we add 10 more hidden layers?
Exercise 3.3: Cross-Entropy Intuition¶
For a vocabulary of 10,000 words:
a) What is the cross-entropy loss if the model assigns probability 0.5 to the correct word? b) What if it assigns 0.0001? c) What is the minimum possible loss (perfect prediction)?
Exercise 3.4: Softmax Temperature¶
The softmax function is: softmax(z)_i = exp(z_i) / Σ exp(z_j)
a) If all logits are equal, what is the output distribution? b) If z = [10, 0, 0], approximately what are the probabilities? c) What happens to the distribution as the largest logit grows to infinity?
Implementation Exercises¶
Exercise 3.5: Embedding from Scratch¶
Implement an embedding layer:
class Embedding:
def __init__(self, vocab_size: int, dim: int):
self.W = np.random.randn(vocab_size, dim) * 0.01
def forward(self, indices: np.ndarray) -> np.ndarray:
"""
Args:
indices: token indices [batch_size, seq_len]
Returns:
embeddings [batch_size, seq_len, dim]
"""
# TODO: Implement (hint: fancy indexing)
pass
def backward(self, grad: np.ndarray, indices: np.ndarray) -> None:
"""Accumulate gradients into self.W_grad"""
# TODO: Implement (hint: np.add.at)
pass
Exercise 3.6: Two-Layer Network¶
Implement a complete 2-layer neural language model:
class NeuralLM:
def __init__(self, vocab_size, embed_dim, hidden_dim, context_size):
# TODO: Initialize weights
pass
def forward(self, context_indices):
"""
context_indices: [batch, context_size]
Returns: logits [batch, vocab_size]
"""
# TODO: embedding → flatten → hidden → output
pass
def loss(self, logits, targets):
"""Cross-entropy loss"""
# TODO
pass
Exercise 3.7: Softmax Stability¶
Implement numerically stable softmax and log-softmax:
def stable_softmax(z: np.ndarray) -> np.ndarray:
"""Softmax that doesn't overflow"""
# TODO: subtract max before exp
pass
def log_softmax(z: np.ndarray) -> np.ndarray:
"""Log of softmax, computed stably"""
# TODO: avoid computing softmax then taking log
pass
Exercise 3.8: Perplexity Tracking¶
Add perplexity tracking to training:
def train_epoch(model, data, optimizer):
total_loss = 0
n_tokens = 0
for batch in data:
loss = model.train_step(batch)
total_loss += loss * batch.n_tokens
n_tokens += batch.n_tokens
avg_loss = total_loss / n_tokens
perplexity = ??? # TODO: compute from avg_loss
return perplexity
Challenge Exercises¶
Exercise 3.9: Embedding Visualization¶
Train a small language model on a corpus, then:
a) Extract the learned embeddings b) Apply t-SNE or PCA to reduce to 2D c) Plot and look for clusters (do similar words cluster together?)
Exercise 3.10: Context Window Analysis¶
Train models with different context sizes (1, 2, 4, 8) on the same data.
a) Plot perplexity vs context size b) At what point do diminishing returns set in? c) Why doesn't context size 1000 help much more than 8?
Exercise 3.11: Compare to N-gram¶
On the same dataset:
a) Train a trigram model with smoothing b) Train a neural LM with context size 2 c) Compare perplexity on test set d) Compare generation quality
Checking Your Work¶
- Test suite: See
code/stage-03/tests/test_neural_lm.pyfor expected behavior - Reference implementation: Compare with
code/stage-03/neural_lm.py - Self-check: Verify loss decreases during training and perplexity is reasonable
Mini-Project: Neural Bigram Model¶
Build a neural language model that outperforms your Markov chain from Stage 1.
Requirements¶
- Architecture: Embedding → Linear → Softmax
- Training: Train on the same Shakespeare data from Stage 1
- Comparison: Compare perplexity with your Stage 1 model
Deliverables¶
- [ ] Working neural bigram model
- [ ] Training loop with loss plotting
- [ ] Perplexity comparison table: | Model | Perplexity | |-------|------------| | Bigram Markov | ? | | Neural Bigram | ? |
- [ ] Generated text samples
Extension¶
Extend to trigrams (2 token context). How much does adding context help?