Section 10.5: Choosing a Method¶

Reading time: 8 minutes

The Decision Framework¶

Three main approaches to alignment:

Reward Modeling + Best-of-N: Simple, no RL
RLHF (PPO): Full RL loop
DPO: Direct optimization

Quick Reference¶

Situation	Recommendation
Getting started	DPO
Production system with feedback loop	RLHF
Limited compute	DPO
Custom reward signals	RLHF
Offline preference data	DPO
Maximum flexibility	RLHF

Method Comparison¶

Complexity¶

Simple ←——————————————————————————→ Complex

Best-of-N     DPO           RLHF
  ●————————————●—————————————●

Quality (given enough compute)¶

Lower ←——————————————————————————→ Higher

Best-of-N     DPO           RLHF
  ●————————————●—————————————●
            (often comparable)

Stability¶

Less Stable ←————————————————→ More Stable

RLHF           Best-of-N       DPO
  ●—————————————●———————————————●

Detailed Trade-offs¶

Best-of-N Sampling¶

How it works:

Train a reward model
Generate N responses
Select the highest-reward response

Pros:

Very simple
No policy training needed
Can use any reward signal

Cons:

Expensive at inference (N forward passes)
Doesn't improve the policy itself
Quality limited by sampling

Best for: Quick experiments, when inference cost doesn't matter.

RLHF with PPO¶

How it works:

Train reward model
Generate responses from policy
Score with reward model
Update policy with PPO
Repeat

Pros:

Online learning (can improve from new feedback)
Flexible reward signals
Well-studied algorithm

Cons:

Complex: 3 models to manage
Unstable: RL training is tricky
Sample inefficient: needs many generations

Best for: Production systems with continuous feedback, custom rewards.

DPO¶

How it works:

Collect preference pairs
Compute log probs under policy and reference
Apply DPO loss
Update policy

Pros:

Simple: supervised learning style
Stable: no RL instabilities
Efficient: no generation during training

Cons:

Offline only: can't learn from new preferences during training
Requires good reference model
Less flexible than reward models

Best for: Most use cases, especially when starting out.

Decision Tree¶

Start
  │
  ├─ Do you need online learning from new preferences?
  │    │
  │    Yes ──▶ RLHF
  │    │
  │    No ──▶ Continue
  │
  ├─ Do you have custom reward signals (not just preferences)?
  │    │
  │    Yes ──▶ RLHF (or Reward Model + Best-of-N)
  │    │
  │    No ──▶ Continue
  │
  ├─ Is simplicity and stability important?
  │    │
  │    Yes ──▶ DPO
  │    │
  │    No ──▶ Continue
  │
  └─ Default ──▶ DPO (it's almost always a good choice)

Practical Recommendations¶

Starting a New Project¶

Start with DPO
Get a baseline working
Only add RLHF complexity if needed

Production System¶

DPO for initial alignment
Add RLHF if you have continuous feedback
Consider Best-of-N for safety-critical applications

Research¶

DPO for quick experiments
RLHF for studying online learning
Both for comparing methods

Combining Methods¶

You can combine approaches:

DPO + Online Feedback¶

Train initial policy with DPO
Collect new preferences from deployed model
Fine-tune with more DPO

RLHF + DPO Initialization¶

Pre-train policy with DPO
Continue with RLHF for online learning

Multi-Stage Alignment¶

SFT → DPO (general alignment) → RLHF (task-specific refinement)

Common Mistakes¶

Mistake	Why It's Wrong	Fix
Starting with RLHF	Unnecessary complexity	Start with DPO
No reference model in DPO	KL constraint is essential	Always use reference
Too low beta in DPO	Model diverges from reference	Start with β=0.1
Poor preference data	Garbage in, garbage out	Invest in data quality
Ignoring evaluation	Can't tell if it's working	Measure continuously

Evaluation¶

Whatever method you choose, evaluate on:

Preference accuracy: Does the model match human preferences?
Safety: Does it refuse harmful requests?
Helpfulness: Does it solve user problems?
Capability: Did it lose pre-training abilities?

Summary¶

Method	Complexity	Stability	Flexibility	Best For
Best-of-N	Low	High	Medium	Quick experiments
DPO	Medium	High	Medium	Most use cases
RLHF	High	Low	High	Online learning

Default recommendation: Start with DPO. It's simple, stable, and effective.

Key insight: You don't need the most complex method—you need the method that works for your situation.

Next: We'll implement all these methods from scratch.