Stage 10: Alignment¶
Making models helpful, harmless, and honest
Overview¶
Pre-trained LLMs are powerful pattern matchers that predict next tokens. But prediction isn't enough—we want models that:
- Help users accomplish their goals
- Avoid harm even when prompted to cause it
- Be honest about uncertainty and limitations
This gap between "predicts well" and "behaves well" is the alignment problem.
"A model that predicts text perfectly might still write harmful content perfectly."
The Core Challenge¶
Pre-training objective: Maximize \(P(\text{next token} | \text{context})\)
What we actually want: Maximize \(P(\text{human approves of response} | \text{context})\)
These are different! A model trained only on prediction might:
- Generate toxic content if that's what the context suggests
- Confidently state falsehoods
- Help with harmful requests
The Solution: Preference Learning¶
Instead of defining "good behavior" with rules, we learn it from human preferences:
- Show humans two model responses
- Ask which is better
- Train the model to produce more preferred responses
This is the foundation of both RLHF and DPO.
Methods We'll Cover¶
| Method | Approach | Complexity |
|---|---|---|
| Reward Modeling | Learn a "goodness" function | Medium |
| RLHF (PPO) | Use RL to optimize for reward | High |
| DPO | Direct preference optimization | Low |
Why This Matters¶
Alignment is what makes the difference between:
- A text generator and an assistant
- A pattern matcher and a helpful tool
- A liability and a product
Most "AI safety" concerns are really about alignment.
Learning Objectives¶
By the end of this stage, you will:
- Understand why alignment is necessary
- Implement reward modeling with Bradley-Terry
- Understand RLHF and PPO basics
- Implement DPO from scratch
- Know when to use each approach
Sections¶
- The Alignment Problem - Why prediction isn't enough
- Reward Modeling - Learning from preferences
- RLHF with PPO - Reinforcement learning approach
- Direct Preference Optimization - A simpler alternative
- Choosing a Method - Trade-offs and recommendations
- Implementation - Building alignment from scratch
Prerequisites¶
- Understanding of neural network training (Stage 2-3)
- Familiarity with language model architecture (Stage 6)
- Basic probability and optimization concepts
Key Insight¶
Alignment doesn't require solving the "hard problem" of defining what's good. It requires learning from human judgments—which humans are very good at providing.
Code & Resources¶
| Resource | Description |
|---|---|
code/stage-10/alignment.py |
Reward Model, RLHF, and DPO |
code/stage-10/tests/ |
Test suite |
| Exercises | Practice problems |
| Common Mistakes | Debugging guide |