Section 10.4: Direct Preference Optimization (DPO)¶

Reading time: 12 minutes

The DPO Revolution¶

Rafailov et al. (2023) asked: Do we actually need RL?

The answer: No.

DPO achieves the same goal as RLHF with a simple supervised loss.

The Key Insight¶

The optimal RLHF policy has a closed form:

\[\pi^*(y|x) \propto \pi_{\text{ref}}(y|x) \exp\left(\frac{r(x,y)}{\beta}\right)\]

Rearranging for the reward:

\[r(x,y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)\]

The reward is just a log ratio!

From Reward to Loss¶

Substituting the implicit reward into Bradley-Terry:

\[P(\text{chosen} > \text{rejected}) = \sigma\left(\beta \log \frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\]

The DPO loss:

\[\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log \sigma\left(\beta \log \frac{\pi(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi(y_l|x)}{\pi_{\text{ref}}(y_l|x)}\right)\right]\]

What This Means¶

RLHF:

Train reward model
Generate samples
Compute rewards
PPO update
Repeat

DPO:

Compute log probs under policy and reference
Apply DPO loss
Update

That's it. No reward model. No RL. No generation during training.

Implementation¶

class DPOTrainer:
    """Direct Preference Optimization trainer."""

    def __init__(self, beta: float = 0.1):
        self.beta = beta

    def compute_loss(
        self,
        policy_chosen_logps: np.ndarray,    # log π(y_w|x)
        policy_rejected_logps: np.ndarray,  # log π(y_l|x)
        ref_chosen_logps: np.ndarray,       # log π_ref(y_w|x)
        ref_rejected_logps: np.ndarray,     # log π_ref(y_l|x)
    ):
        # Log ratios
        chosen_logratios = policy_chosen_logps - ref_chosen_logps
        rejected_logratios = policy_rejected_logps - ref_rejected_logps

        # Implicit rewards (scaled)
        chosen_rewards = self.beta * chosen_logratios
        rejected_rewards = self.beta * rejected_logratios

        # DPO loss
        logits = chosen_rewards - rejected_rewards
        loss = -np.mean(np.log(sigmoid(logits)))

        # Accuracy: how often do we correctly predict preference?
        accuracy = np.mean(logits > 0)

        return loss, accuracy

The DPO Training Loop¶

# Freeze reference model
ref_model = copy(policy_model)
ref_model.freeze()

for batch in preference_data:
    prompt = batch.prompt
    chosen = batch.chosen
    rejected = batch.rejected

    # Compute log probs (no generation needed!)
    policy_chosen_logps = policy_model.log_prob(prompt, chosen)
    policy_rejected_logps = policy_model.log_prob(prompt, rejected)
    ref_chosen_logps = ref_model.log_prob(prompt, chosen)
    ref_rejected_logps = ref_model.log_prob(prompt, rejected)

    # DPO loss
    loss, accuracy = dpo_trainer.compute_loss(
        policy_chosen_logps,
        policy_rejected_logps,
        ref_chosen_logps,
        ref_rejected_logps,
    )

    # Standard supervised update
    loss.backward()
    optimizer.step()

Why DPO Works¶

Mathematical Equivalence¶

DPO optimizes the same objective as RLHF:

\[\max_\pi \mathbb{E}[r(x,y)] - \beta \cdot \text{KL}(\pi || \pi_{\text{ref}})\]

But directly, without the RL machinery.

Implicit Reward¶

The trained policy implicitly defines a reward:

\[r(x,y) = \beta \log \frac{\pi(y|x)}{\pi_{\text{ref}}(y|x)}\]

You can extract this for analysis if needed.

The Beta Parameter¶

\(\beta\) controls the trade-off:

Beta	Effect
High (0.5+)	Stay close to reference, conservative updates
Medium (0.1)	Balanced (common default)
Low (0.01)	Aggressively optimize preferences, may diverge

Default recommendation: Start with \(\beta = 0.1\).

DPO vs RLHF¶

Aspect	RLHF	DPO
Reward model	Required	Not needed
RL loop	Yes (PPO)	No
Generation during training	Yes	No
Models needed	3 (policy, ref, reward)	2 (policy, ref)
Stability	Tricky	Very stable
Sample efficiency	Low	High
Flexibility	High	Medium

When to Choose DPO¶

Use DPO when:

You have offline preference data
You want simplicity and stability
You don't need online learning

Use RLHF when:

You need online learning from new preferences
You want to use custom reward signals
You have complex reward functions

Variations¶

IPO (Identity Preference Optimization)¶

Addresses potential overfitting in DPO:

\[\mathcal{L}_{\text{IPO}} = \left(\log \frac{\pi(y_w)}{\pi_{\text{ref}}(y_w)} - \log \frac{\pi(y_l)}{\pi_{\text{ref}}(y_l)} - \frac{1}{2\beta}\right)^2\]

KTO (Kahneman-Tversky Optimization)¶

Works with binary feedback (good/bad) instead of comparisons.

ORPO (Odds Ratio Preference Optimization)¶

Incorporates the loss into a single supervised objective.

Practical Tips¶

1. Reference Model¶

Must be frozen. Usually the SFT model.

2. Learning Rate¶

Typically lower than SFT (1e-6 to 1e-5).

3. Batch Size¶

Larger batches help with stability.

4. Data Quality¶

DPO is very sensitive to preference data quality.

5. Label Smoothing¶

Can help prevent overconfidence:

labels = 1.0 - label_smoothing  # e.g., 0.99 instead of 1.0

Metrics to Track¶

Metric	What It Tells You
Loss	Training progress
Accuracy	Preference prediction
Chosen reward	How "good" the chosen response is
Rejected reward	How "good" the rejected response is
Reward margin	chosen - rejected (should increase)

Summary¶

Concept	Description
Core idea	Optimal policy has closed form
Loss	Log sigmoid of reward difference
Beta	KL penalty strength
Advantage	Simpler, more stable than RLHF
Limitation	Offline only

Key insight: DPO shows that alignment doesn't require RL—it can be done with supervised learning.

Next: We'll compare the methods and discuss when to use each.