Interactive Visualizations¶
Explore the concepts from each stage with these interactive tools. All visualizations run entirely in your browser—no server required.
-
Attention Visualizer
Explore how self-attention works. See how queries match keys, observe attention weights, and understand causal masking.
From Stage 5: Attention
-
Gradient Descent Visualizer
Watch optimizers navigate loss landscapes. Compare SGD, momentum, RMSprop, and Adam on different surfaces.
From Stage 4: Optimization
-
Autograd Visualizer
Watch automatic differentiation in action. Build computational graphs, run forward passes, and see gradients flow backward.
From Stage 2: Automatic Differentiation
-
Temperature Sampling
See how temperature transforms probability distributions. Experiment with different temperatures and sample tokens.
From Stage 1: Markov Chains
-
N-gram State Machine
Visualize Markov chains as state machines. Train on custom text, watch state transitions, and generate text step-by-step.
From Stage 1: Markov Chains
What These Visualizations Teach¶
Attention Visualizer (Stage 5) — NEW¶
The attention visualizer demonstrates the core concepts from Sections 5.1-5.7:
- Query-Key matching: See how queries find relevant keys
- Attention weights: Observe the softmax distribution over positions
- Causal masking: Enable GPT-style masking to prevent future attention
- Temperature effects: Watch attention sharpen or soften with temperature
Try these experiments:
| Pattern | Masking | What it demonstrates |
|---|---|---|
| Random | None | Untrained attention, spread distribution |
| Self-Attention | None | Each position attends to itself |
| Previous Token | Causal | Local context pattern |
| Syntactic | None | Content word relationships |
| Distance-Based | Causal | Positional attention decay |
Gradient Descent Visualizer (Stage 4)¶
The optimizer visualizer demonstrates the core concepts from Sections 4.2-4.5:
- Loss landscapes: See how different surfaces create different optimization challenges
- Optimizer comparison: Watch how SGD, momentum, and Adam behave differently
- Hyperparameter effects: Explore how learning rate and momentum coefficients affect convergence
- Condition number: Observe zigzagging on elongated valleys
Try these experiments:
| Surface | Optimizer | What it demonstrates |
|---|---|---|
| Elongated Valley | SGD | Zigzag problem, slow convergence |
| Elongated Valley | Momentum | Dampens oscillation, faster |
| Rosenbrock | Adam | Navigates curved valleys |
| Saddle Point | Any | Escape behavior (or getting stuck) |
| Rastrigin | Adam | Local minima challenges |
N-gram State Machine (Stage 1)¶
The n-gram visualizer demonstrates the core concepts from Sections 1.1-1.3:
- State machine view: Markov chains are finite state automata
- Training = counting: Watch how observations become transition probabilities
- Generation: Sample from the model one token at a time
- Context dependence: See how history determines the next token distribution
Try these experiments:
| Training Text | What it demonstrates |
|---|---|
abab |
Deterministic patterns |
the cat sat on the mat |
Natural language structure |
to be or not to be |
Repeated patterns create loops |
aaaaabbbbb |
Imbalanced distributions |
Autograd Visualizer (Stage 2)¶
The autograd visualizer demonstrates the core concepts from Section 2.4-2.6:
- Computational graphs: See how mathematical expressions become directed acyclic graphs
- Forward pass: Watch values propagate from inputs to outputs
- Backward pass: Observe gradients flow in reverse via the chain rule
- Local gradients: Each operation contributes its local derivative
Try these expressions to explore different patterns:
| Expression | What it demonstrates |
|---|---|
(x + y) * z |
Basic operations, gradient accumulation |
x * x + y * y |
Sum of squares, independent gradients |
(x * y) + (y * z) |
Shared variable (y appears twice) |
x * x * x |
Power rule in action |
Temperature Sampling (Stage 1)¶
The temperature explorer demonstrates concepts from Section 1.6:
- Probability distributions: How language models represent uncertainty
- Temperature scaling: The formula P_T(t) = P(t)^(1/T) / Z
- Entropy: How "spread out" the distribution is
- Effective vocabulary: Perplexity as the "equivalent uniform vocabulary size"
Key insights to discover:
| Temperature | Effect | Use case |
|---|---|---|
| T → 0 | Greedy (argmax) | Deterministic outputs |
| T = 0.7 | Slightly focused | Coherent generation |
| T = 1.0 | Original distribution | Balanced sampling |
| T = 1.5 | More random | Creative writing |
| T → ∞ | Uniform | Maximum diversity |
Technical Notes¶
These visualizations are built with:
- Vanilla JavaScript: No build tools required, matching the "from scratch" philosophy
- D3.js: For reactive data visualization
- Portable design: Work offline, embed anywhere
The autograd visualizer is a direct port of the Python Value class from Stage 2, demonstrating that the same concepts work across languages.