Section 10.5: Choosing a Method¶
Reading time: 8 minutes
The Decision Framework¶
Three main approaches to alignment:
- Reward Modeling + Best-of-N: Simple, no RL
- RLHF (PPO): Full RL loop
- DPO: Direct optimization
Quick Reference¶
| Situation | Recommendation |
|---|---|
| Getting started | DPO |
| Production system with feedback loop | RLHF |
| Limited compute | DPO |
| Custom reward signals | RLHF |
| Offline preference data | DPO |
| Maximum flexibility | RLHF |
Method Comparison¶
Complexity¶
Quality (given enough compute)¶
Lower ←——————————————————————————→ Higher
Best-of-N DPO RLHF
●————————————●—————————————●
(often comparable)
Stability¶
Detailed Trade-offs¶
Best-of-N Sampling¶
How it works:
- Train a reward model
- Generate N responses
- Select the highest-reward response
Pros:
- Very simple
- No policy training needed
- Can use any reward signal
Cons:
- Expensive at inference (N forward passes)
- Doesn't improve the policy itself
- Quality limited by sampling
Best for: Quick experiments, when inference cost doesn't matter.
RLHF with PPO¶
How it works:
- Train reward model
- Generate responses from policy
- Score with reward model
- Update policy with PPO
- Repeat
Pros:
- Online learning (can improve from new feedback)
- Flexible reward signals
- Well-studied algorithm
Cons:
- Complex: 3 models to manage
- Unstable: RL training is tricky
- Sample inefficient: needs many generations
Best for: Production systems with continuous feedback, custom rewards.
DPO¶
How it works:
- Collect preference pairs
- Compute log probs under policy and reference
- Apply DPO loss
- Update policy
Pros:
- Simple: supervised learning style
- Stable: no RL instabilities
- Efficient: no generation during training
Cons:
- Offline only: can't learn from new preferences during training
- Requires good reference model
- Less flexible than reward models
Best for: Most use cases, especially when starting out.
Decision Tree¶
Start
│
├─ Do you need online learning from new preferences?
│ │
│ Yes ──▶ RLHF
│ │
│ No ──▶ Continue
│
├─ Do you have custom reward signals (not just preferences)?
│ │
│ Yes ──▶ RLHF (or Reward Model + Best-of-N)
│ │
│ No ──▶ Continue
│
├─ Is simplicity and stability important?
│ │
│ Yes ──▶ DPO
│ │
│ No ──▶ Continue
│
└─ Default ──▶ DPO (it's almost always a good choice)
Practical Recommendations¶
Starting a New Project¶
- Start with DPO
- Get a baseline working
- Only add RLHF complexity if needed
Production System¶
- DPO for initial alignment
- Add RLHF if you have continuous feedback
- Consider Best-of-N for safety-critical applications
Research¶
- DPO for quick experiments
- RLHF for studying online learning
- Both for comparing methods
Combining Methods¶
You can combine approaches:
DPO + Online Feedback¶
- Train initial policy with DPO
- Collect new preferences from deployed model
- Fine-tune with more DPO
RLHF + DPO Initialization¶
- Pre-train policy with DPO
- Continue with RLHF for online learning
Multi-Stage Alignment¶
Common Mistakes¶
| Mistake | Why It's Wrong | Fix |
|---|---|---|
| Starting with RLHF | Unnecessary complexity | Start with DPO |
| No reference model in DPO | KL constraint is essential | Always use reference |
| Too low beta in DPO | Model diverges from reference | Start with β=0.1 |
| Poor preference data | Garbage in, garbage out | Invest in data quality |
| Ignoring evaluation | Can't tell if it's working | Measure continuously |
Evaluation¶
Whatever method you choose, evaluate on:
- Preference accuracy: Does the model match human preferences?
- Safety: Does it refuse harmful requests?
- Helpfulness: Does it solve user problems?
- Capability: Did it lose pre-training abilities?
Summary¶
| Method | Complexity | Stability | Flexibility | Best For |
|---|---|---|---|---|
| Best-of-N | Low | High | Medium | Quick experiments |
| DPO | Medium | High | Medium | Most use cases |
| RLHF | High | Low | High | Online learning |
Default recommendation: Start with DPO. It's simple, stable, and effective.
Key insight: You don't need the most complex method—you need the method that works for your situation.
Next: We'll implement all these methods from scratch.