Stage 1 Exercises¶

Conceptual Questions¶

Exercise 1.1: Order Trade-offs¶

Consider a language model for autocomplete suggestions.

Question: You're building autocomplete for a search engine. Users type 2-5 characters and expect instant suggestions. Would you use a unigram, bigram, or trigram model? Justify your choice considering:

Response time requirements
Quality of suggestions
Memory constraints on mobile devices

Exercise 1.2: Probability Chains¶

Given this tiny corpus: "the cat sat on the mat"

a) Calculate P("the" | "on") for a bigram model b) What is P("cat" | "the", bigram)? c) Why is P("dog" | "the") = 0 a problem?

Exercise 1.3: Information Content¶

A language model assigns these probabilities:

P("Paris" | "The capital of France is") = 0.85
P("London" | "The capital of France is") = 0.05
P("Berlin" | "The capital of France is") = 0.03

a) Which word carries the most "surprise" (information)? b) Calculate the surprisal (in bits) for each word c) If the model assigns P("Paris") = 0.001 without context, how much does context help?

Exercise 1.4: Temperature Intuition¶

Without running code, predict what happens:

a) T = 0.1 with uniform distribution [0.25, 0.25, 0.25, 0.25] b) T = 10.0 with peaked distribution [0.9, 0.05, 0.03, 0.02] c) At what temperature does any distribution become uniform?

Implementation Exercises¶

Exercise 1.5: Trigram Implementation¶

Extend the bigram model to a trigram model.

class TrigramModel:
    def __init__(self):
        self.counts = {}  # (w1, w2) -> {w3: count}
        self.context_totals = {}

    def train(self, text: str) -> None:
        # TODO: Implement trigram counting
        pass

    def probability(self, word: str, context: Tuple[str, str]) -> float:
        # TODO: Return P(word | context)
        pass

Test: Train on "the cat sat on the mat the cat ran" and verify: - P("cat" | "the", "the") should be higher than P("sat" | "the", "the")

Exercise 1.6: Perplexity Calculator¶

Implement a function to compute perplexity on held-out data:

def compute_perplexity(model, test_text: str) -> float:
    """
    Compute perplexity of model on test_text.

    Perplexity = exp(-1/N * sum(log P(w_i | context)))
    """
    # TODO: Implement
    pass

Test: A model that assigns uniform probability 1/V should have perplexity = V

Exercise 1.7: Add-k Smoothing¶

Implement add-k smoothing with tunable k:

def smoothed_probability(
    word: str,
    context: str,
    counts: dict,
    vocab_size: int,
    k: float = 1.0
) -> float:
    """
    P_smooth(word | context) = (count + k) / (total + k * V)
    """
    # TODO: Implement
    pass

Experiment: Plot perplexity vs k for k in [0.001, 0.01, 0.1, 1.0, 10.0]

Exercise 1.8: Interpolation¶

Implement linear interpolation between unigram and bigram:

def interpolated_probability(
    word: str,
    context: str,
    unigram_model,
    bigram_model,
    lambda_1: float = 0.5
) -> float:
    """
    P(w|c) = λ₁ * P_bi(w|c) + (1-λ₁) * P_uni(w)
    """
    # TODO: Implement
    pass

Challenge Exercises¶

Exercise 1.9: Optimal λ via Cross-Validation¶

Implement a function that finds the optimal interpolation weight:

def find_optimal_lambda(
    train_text: str,
    val_text: str,
    lambdas: List[float]
) -> float:
    """
    Find λ that minimizes perplexity on validation set.
    """
    # TODO: Train models, evaluate each λ, return best
    pass

Exercise 1.10: Corpus Analysis¶

Download a real corpus (e.g., first chapter of a book from Project Gutenberg).

a) Train unigram, bigram, and trigram models b) Compare perplexity on held-out text c) Generate 100 characters from each model with T=1.0 d) Which produces more coherent text? Why?

Exercise 1.11: Context Length Explosion¶

For a vocabulary of size V=10,000:

a) How many possible bigram contexts are there? b) How many possible 5-gram contexts? c) If you have 1 million tokens of training data, what fraction of 5-gram contexts will you observe? d) This is called the "curse of dimensionality." How does it motivate neural approaches?

Checking Your Work¶

Test suite: See code/stage-01/tests/test_markov.py for expected behavior
Reference implementation: Compare with code/stage-01/markov.py
Self-check: Verify perplexity calculation matches examples in documentation

Mini-Project: Shakespeare Generator¶

Build a complete Markov chain text generator trained on Shakespeare.

Requirements¶

Data: Download a Shakespeare text file (e.g., from Project Gutenberg)
Model: Implement a trigram Markov model with Laplace smoothing
Generation: Generate 100-word passages at different temperatures
Evaluation: Compute perplexity on a held-out test set

Deliverables¶

[ ] Working trigram model with smoothing
[ ] Generation function with temperature control
[ ] Perplexity evaluation on test data
[ ] Comparison: bigram vs. trigram perplexity
[ ] Generated samples at T=0.5, T=1.0, T=2.0

Extension¶

Try 4-grams and 5-grams. Plot perplexity vs. n-gram order. At what point do you see diminishing returns or overfitting?