Skip to content

Interactive Visualizations

Explore the concepts from each stage with these interactive tools. All visualizations run entirely in your browser—no server required.

  • Attention Visualizer


    Explore how self-attention works. See how queries match keys, observe attention weights, and understand causal masking.

    Launch Attention Visualizer

    From Stage 5: Attention

  • Gradient Descent Visualizer


    Watch optimizers navigate loss landscapes. Compare SGD, momentum, RMSprop, and Adam on different surfaces.

    Launch Optimizer Visualizer

    From Stage 4: Optimization

  • Autograd Visualizer


    Watch automatic differentiation in action. Build computational graphs, run forward passes, and see gradients flow backward.

    Launch Autograd Visualizer

    From Stage 2: Automatic Differentiation

  • Temperature Sampling


    See how temperature transforms probability distributions. Experiment with different temperatures and sample tokens.

    Launch Temperature Explorer

    From Stage 1: Markov Chains

  • N-gram State Machine


    Visualize Markov chains as state machines. Train on custom text, watch state transitions, and generate text step-by-step.

    Launch N-gram Visualizer

    From Stage 1: Markov Chains

What These Visualizations Teach

Attention Visualizer (Stage 5) — NEW

The attention visualizer demonstrates the core concepts from Sections 5.1-5.7:

  • Query-Key matching: See how queries find relevant keys
  • Attention weights: Observe the softmax distribution over positions
  • Causal masking: Enable GPT-style masking to prevent future attention
  • Temperature effects: Watch attention sharpen or soften with temperature

Try these experiments:

Pattern Masking What it demonstrates
Random None Untrained attention, spread distribution
Self-Attention None Each position attends to itself
Previous Token Causal Local context pattern
Syntactic None Content word relationships
Distance-Based Causal Positional attention decay

Gradient Descent Visualizer (Stage 4)

The optimizer visualizer demonstrates the core concepts from Sections 4.2-4.5:

  • Loss landscapes: See how different surfaces create different optimization challenges
  • Optimizer comparison: Watch how SGD, momentum, and Adam behave differently
  • Hyperparameter effects: Explore how learning rate and momentum coefficients affect convergence
  • Condition number: Observe zigzagging on elongated valleys

Try these experiments:

Surface Optimizer What it demonstrates
Elongated Valley SGD Zigzag problem, slow convergence
Elongated Valley Momentum Dampens oscillation, faster
Rosenbrock Adam Navigates curved valleys
Saddle Point Any Escape behavior (or getting stuck)
Rastrigin Adam Local minima challenges

N-gram State Machine (Stage 1)

The n-gram visualizer demonstrates the core concepts from Sections 1.1-1.3:

  • State machine view: Markov chains are finite state automata
  • Training = counting: Watch how observations become transition probabilities
  • Generation: Sample from the model one token at a time
  • Context dependence: See how history determines the next token distribution

Try these experiments:

Training Text What it demonstrates
abab Deterministic patterns
the cat sat on the mat Natural language structure
to be or not to be Repeated patterns create loops
aaaaabbbbb Imbalanced distributions

Autograd Visualizer (Stage 2)

The autograd visualizer demonstrates the core concepts from Section 2.4-2.6:

  • Computational graphs: See how mathematical expressions become directed acyclic graphs
  • Forward pass: Watch values propagate from inputs to outputs
  • Backward pass: Observe gradients flow in reverse via the chain rule
  • Local gradients: Each operation contributes its local derivative

Try these expressions to explore different patterns:

Expression What it demonstrates
(x + y) * z Basic operations, gradient accumulation
x * x + y * y Sum of squares, independent gradients
(x * y) + (y * z) Shared variable (y appears twice)
x * x * x Power rule in action

Temperature Sampling (Stage 1)

The temperature explorer demonstrates concepts from Section 1.6:

  • Probability distributions: How language models represent uncertainty
  • Temperature scaling: The formula P_T(t) = P(t)^(1/T) / Z
  • Entropy: How "spread out" the distribution is
  • Effective vocabulary: Perplexity as the "equivalent uniform vocabulary size"

Key insights to discover:

Temperature Effect Use case
T → 0 Greedy (argmax) Deterministic outputs
T = 0.7 Slightly focused Coherent generation
T = 1.0 Original distribution Balanced sampling
T = 1.5 More random Creative writing
T → ∞ Uniform Maximum diversity

Technical Notes

These visualizations are built with:

  • Vanilla JavaScript: No build tools required, matching the "from scratch" philosophy
  • D3.js: For reactive data visualization
  • Portable design: Work offline, embed anywhere

The autograd visualizer is a direct port of the Python Value class from Stage 2, demonstrating that the same concepts work across languages.