Section 10.4: Direct Preference Optimization (DPO)¶
Reading time: 12 minutes
The DPO Revolution¶
Rafailov et al. (2023) asked: Do we actually need RL?
The answer: No.
DPO achieves the same goal as RLHF with a simple supervised loss.
The Key Insight¶
The optimal RLHF policy has a closed form:
Rearranging for the reward:
The reward is just a log ratio!
From Reward to Loss¶
Substituting the implicit reward into Bradley-Terry:
The DPO loss:
What This Means¶
RLHF:
- Train reward model
- Generate samples
- Compute rewards
- PPO update
- Repeat
DPO:
- Compute log probs under policy and reference
- Apply DPO loss
- Update
That's it. No reward model. No RL. No generation during training.
Implementation¶
class DPOTrainer:
"""Direct Preference Optimization trainer."""
def __init__(self, beta: float = 0.1):
self.beta = beta
def compute_loss(
self,
policy_chosen_logps: np.ndarray, # log π(y_w|x)
policy_rejected_logps: np.ndarray, # log π(y_l|x)
ref_chosen_logps: np.ndarray, # log π_ref(y_w|x)
ref_rejected_logps: np.ndarray, # log π_ref(y_l|x)
):
# Log ratios
chosen_logratios = policy_chosen_logps - ref_chosen_logps
rejected_logratios = policy_rejected_logps - ref_rejected_logps
# Implicit rewards (scaled)
chosen_rewards = self.beta * chosen_logratios
rejected_rewards = self.beta * rejected_logratios
# DPO loss
logits = chosen_rewards - rejected_rewards
loss = -np.mean(np.log(sigmoid(logits)))
# Accuracy: how often do we correctly predict preference?
accuracy = np.mean(logits > 0)
return loss, accuracy
The DPO Training Loop¶
# Freeze reference model
ref_model = copy(policy_model)
ref_model.freeze()
for batch in preference_data:
prompt = batch.prompt
chosen = batch.chosen
rejected = batch.rejected
# Compute log probs (no generation needed!)
policy_chosen_logps = policy_model.log_prob(prompt, chosen)
policy_rejected_logps = policy_model.log_prob(prompt, rejected)
ref_chosen_logps = ref_model.log_prob(prompt, chosen)
ref_rejected_logps = ref_model.log_prob(prompt, rejected)
# DPO loss
loss, accuracy = dpo_trainer.compute_loss(
policy_chosen_logps,
policy_rejected_logps,
ref_chosen_logps,
ref_rejected_logps,
)
# Standard supervised update
loss.backward()
optimizer.step()
Why DPO Works¶
Mathematical Equivalence¶
DPO optimizes the same objective as RLHF:
But directly, without the RL machinery.
Implicit Reward¶
The trained policy implicitly defines a reward:
You can extract this for analysis if needed.
The Beta Parameter¶
\(\beta\) controls the trade-off:
| Beta | Effect |
|---|---|
| High (0.5+) | Stay close to reference, conservative updates |
| Medium (0.1) | Balanced (common default) |
| Low (0.01) | Aggressively optimize preferences, may diverge |
Default recommendation: Start with \(\beta = 0.1\).
DPO vs RLHF¶
| Aspect | RLHF | DPO |
|---|---|---|
| Reward model | Required | Not needed |
| RL loop | Yes (PPO) | No |
| Generation during training | Yes | No |
| Models needed | 3 (policy, ref, reward) | 2 (policy, ref) |
| Stability | Tricky | Very stable |
| Sample efficiency | Low | High |
| Flexibility | High | Medium |
When to Choose DPO¶
Use DPO when:
- You have offline preference data
- You want simplicity and stability
- You don't need online learning
Use RLHF when:
- You need online learning from new preferences
- You want to use custom reward signals
- You have complex reward functions
Variations¶
IPO (Identity Preference Optimization)¶
Addresses potential overfitting in DPO:
KTO (Kahneman-Tversky Optimization)¶
Works with binary feedback (good/bad) instead of comparisons.
ORPO (Odds Ratio Preference Optimization)¶
Incorporates the loss into a single supervised objective.
Practical Tips¶
1. Reference Model¶
Must be frozen. Usually the SFT model.
2. Learning Rate¶
Typically lower than SFT (1e-6 to 1e-5).
3. Batch Size¶
Larger batches help with stability.
4. Data Quality¶
DPO is very sensitive to preference data quality.
5. Label Smoothing¶
Can help prevent overconfidence:
Metrics to Track¶
| Metric | What It Tells You |
|---|---|
| Loss | Training progress |
| Accuracy | Preference prediction |
| Chosen reward | How "good" the chosen response is |
| Rejected reward | How "good" the rejected response is |
| Reward margin | chosen - rejected (should increase) |
Summary¶
| Concept | Description |
|---|---|
| Core idea | Optimal policy has closed form |
| Loss | Log sigmoid of reward difference |
| Beta | KL penalty strength |
| Advantage | Simpler, more stable than RLHF |
| Limitation | Offline only |
Key insight: DPO shows that alignment doesn't require RL—it can be done with supervised learning.
Next: We'll compare the methods and discuss when to use each.