Section 10.2: Reward Modeling¶
Reading time: 12 minutes
The Core Idea¶
We want to optimize for "human preferences," but neural networks need numbers. Reward models bridge this gap:
Reward model: A function \(r(x, y)\) that predicts how much humans would prefer response \(y\) to prompt \(x\).
Preference Data¶
The fundamental unit of alignment data:
@dataclass
class PreferencePair:
prompt: str
chosen: str # The preferred response
rejected: str # The less preferred response
Human annotators see both responses and choose which is better.
Example¶
Prompt: "How do I cook pasta?"
Response A: "Boil water, add pasta, cook 8-10 minutes, drain."
Response B: "Pasta is a type of Italian food made from wheat."
Human preference: A > B (A is more helpful)
The Bradley-Terry Model¶
How do we turn preferences into a trainable objective?
The Bradley-Terry model says:
Where \(\sigma\) is the sigmoid function.
Intuition: If A has higher reward, it's more likely to be preferred.
The Loss Function¶
Given preference pair (chosen, rejected), we want:
Loss:
This is just binary cross-entropy where we predict "chosen > rejected."
Gradient¶
The gradient pushes \(r_c\) up and \(r_r\) down until the model correctly predicts preferences.
Architecture¶
A reward model typically:
- Takes the same architecture as the language model
- Replaces the output head with a scalar "reward head"
- Uses the final token's representation
Input: [prompt, response]
↓
Transformer (shared with LM)
↓
Final token representation
↓
Linear(d_model, 1)
↓
Scalar reward
Implementation¶
class RewardModel:
"""Reward model that predicts human preferences."""
def __init__(self, input_dim: int, hidden_dim: int = 256):
# Simple MLP reward head
self.W1 = np.random.randn(input_dim, hidden_dim) * 0.01
self.b1 = np.zeros(hidden_dim)
self.W2 = np.random.randn(hidden_dim, 1) * 0.01
self.b2 = np.zeros(1)
def forward(self, x: np.ndarray) -> np.ndarray:
"""Compute reward for input representation."""
h = np.maximum(0, x @ self.W1 + self.b1) # ReLU
return h @ self.W2 + self.b2
def reward_loss(reward_chosen, reward_rejected):
"""Bradley-Terry preference loss."""
diff = reward_chosen - reward_rejected
sigmoid = 1 / (1 + np.exp(-diff))
loss = -np.mean(np.log(sigmoid + 1e-10))
return loss
Training Procedure¶
for batch in preference_data:
# Encode prompt + response pairs
chosen_repr = encode(batch.prompt, batch.chosen)
rejected_repr = encode(batch.prompt, batch.rejected)
# Get rewards
r_chosen = reward_model(chosen_repr)
r_rejected = reward_model(rejected_repr)
# Compute loss
loss = -log(sigmoid(r_chosen - r_rejected))
# Update
loss.backward()
optimizer.step()
Data Collection¶
Human Annotation Process¶
- Sample prompts from user queries
- Generate multiple responses from the model
- Present pairs to annotators
- Collect preferences
Guidelines¶
Annotators typically receive instructions like:
- "Choose the response that is more helpful"
- "Prefer responses that are accurate and honest"
- "Avoid responses that are harmful or offensive"
Quality Control¶
- Multiple annotators per comparison
- Inter-annotator agreement metrics
- Clear guidelines and training
What Makes a Good Reward Model?¶
Generalization¶
Should work on prompts/responses not seen during training.
Calibration¶
High reward should mean high human preference.
Robustness¶
Shouldn't be easily fooled by surface features:
- Length (longer ≠ better)
- Confidence (certain ≠ correct)
- Verbosity (more words ≠ more helpful)
Common Pitfalls¶
1. Length Bias¶
Models often prefer longer responses. Fix: normalize by length or use length-balanced data.
2. Sycophancy¶
Reward models might prefer responses that agree with the user, even when wrong.
3. Reward Hacking¶
The policy might find ways to maximize reward without being actually helpful.
Example: Adding "I hope this helps!" increases reward without improving quality.
4. Distribution Shift¶
Training on generated responses but evaluating on diverse user queries.
Reward Model Evaluation¶
Accuracy¶
On held-out preference pairs:
Target: 70%+ (50% is random)
Calibration¶
Does high reward actually mean human preference?
Qualitative Analysis¶
Manually inspect high-reward and low-reward responses.
From Reward Model to Policy¶
Once we have a reward model, we can:
- RLHF: Use RL to optimize policy for reward
- Best-of-N: Generate N responses, pick highest reward
- DPO: Skip the reward model entirely (next section)
Summary¶
| Component | Purpose |
|---|---|
| Preference pairs | Training data format |
| Bradley-Terry | Turn preferences into probabilities |
| Reward head | Output scalar reward |
| BCE loss | Train to predict preferences |
Key insight: Reward models distill human preferences into a trainable signal.
Next: We'll use the reward model to train policies with RLHF.