Skip to content

Stage 1 Exercises

Conceptual Questions

Exercise 1.1: Order Trade-offs

Consider a language model for autocomplete suggestions.

Question: You're building autocomplete for a search engine. Users type 2-5 characters and expect instant suggestions. Would you use a unigram, bigram, or trigram model? Justify your choice considering:

  • Response time requirements
  • Quality of suggestions
  • Memory constraints on mobile devices

Exercise 1.2: Probability Chains

Given this tiny corpus: "the cat sat on the mat"

a) Calculate P("the" | "on") for a bigram model b) What is P("cat" | "the", bigram)? c) Why is P("dog" | "the") = 0 a problem?

Exercise 1.3: Information Content

A language model assigns these probabilities:

  • P("Paris" | "The capital of France is") = 0.85
  • P("London" | "The capital of France is") = 0.05
  • P("Berlin" | "The capital of France is") = 0.03

a) Which word carries the most "surprise" (information)? b) Calculate the surprisal (in bits) for each word c) If the model assigns P("Paris") = 0.001 without context, how much does context help?

Exercise 1.4: Temperature Intuition

Without running code, predict what happens:

a) T = 0.1 with uniform distribution [0.25, 0.25, 0.25, 0.25] b) T = 10.0 with peaked distribution [0.9, 0.05, 0.03, 0.02] c) At what temperature does any distribution become uniform?


Implementation Exercises

Exercise 1.5: Trigram Implementation

Extend the bigram model to a trigram model.

class TrigramModel:
    def __init__(self):
        self.counts = {}  # (w1, w2) -> {w3: count}
        self.context_totals = {}

    def train(self, text: str) -> None:
        # TODO: Implement trigram counting
        pass

    def probability(self, word: str, context: Tuple[str, str]) -> float:
        # TODO: Return P(word | context)
        pass

Test: Train on "the cat sat on the mat the cat ran" and verify: - P("cat" | "the", "the") should be higher than P("sat" | "the", "the")

Exercise 1.6: Perplexity Calculator

Implement a function to compute perplexity on held-out data:

def compute_perplexity(model, test_text: str) -> float:
    """
    Compute perplexity of model on test_text.

    Perplexity = exp(-1/N * sum(log P(w_i | context)))
    """
    # TODO: Implement
    pass

Test: A model that assigns uniform probability 1/V should have perplexity = V

Exercise 1.7: Add-k Smoothing

Implement add-k smoothing with tunable k:

def smoothed_probability(
    word: str,
    context: str,
    counts: dict,
    vocab_size: int,
    k: float = 1.0
) -> float:
    """
    P_smooth(word | context) = (count + k) / (total + k * V)
    """
    # TODO: Implement
    pass

Experiment: Plot perplexity vs k for k in [0.001, 0.01, 0.1, 1.0, 10.0]

Exercise 1.8: Interpolation

Implement linear interpolation between unigram and bigram:

def interpolated_probability(
    word: str,
    context: str,
    unigram_model,
    bigram_model,
    lambda_1: float = 0.5
) -> float:
    """
    P(w|c) = λ₁ * P_bi(w|c) + (1-λ₁) * P_uni(w)
    """
    # TODO: Implement
    pass

Challenge Exercises

Exercise 1.9: Optimal λ via Cross-Validation

Implement a function that finds the optimal interpolation weight:

def find_optimal_lambda(
    train_text: str,
    val_text: str,
    lambdas: List[float]
) -> float:
    """
    Find λ that minimizes perplexity on validation set.
    """
    # TODO: Train models, evaluate each λ, return best
    pass

Exercise 1.10: Corpus Analysis

Download a real corpus (e.g., first chapter of a book from Project Gutenberg).

a) Train unigram, bigram, and trigram models b) Compare perplexity on held-out text c) Generate 100 characters from each model with T=1.0 d) Which produces more coherent text? Why?

Exercise 1.11: Context Length Explosion

For a vocabulary of size V=10,000:

a) How many possible bigram contexts are there? b) How many possible 5-gram contexts? c) If you have 1 million tokens of training data, what fraction of 5-gram contexts will you observe? d) This is called the "curse of dimensionality." How does it motivate neural approaches?


Checking Your Work

  • Test suite: See code/stage-01/tests/test_markov.py for expected behavior
  • Reference implementation: Compare with code/stage-01/markov.py
  • Self-check: Verify perplexity calculation matches examples in documentation

Mini-Project: Shakespeare Generator

Build a complete Markov chain text generator trained on Shakespeare.

Requirements

  1. Data: Download a Shakespeare text file (e.g., from Project Gutenberg)
  2. Model: Implement a trigram Markov model with Laplace smoothing
  3. Generation: Generate 100-word passages at different temperatures
  4. Evaluation: Compute perplexity on a held-out test set

Deliverables

  • [ ] Working trigram model with smoothing
  • [ ] Generation function with temperature control
  • [ ] Perplexity evaluation on test data
  • [ ] Comparison: bigram vs. trigram perplexity
  • [ ] Generated samples at T=0.5, T=1.0, T=2.0

Extension

Try 4-grams and 5-grams. Plot perplexity vs. n-gram order. At what point do you see diminishing returns or overfitting?