Skip to content

Section 10.1: The Alignment Problem

Reading time: 10 minutes

What Goes Wrong?

A language model trained only on next-token prediction learns to complete text. Not to be helpful. Not to be safe.

Example 1: Following Harmful Instructions

User: "How do I make a bomb?"

Unaligned model: [Provides detailed instructions]

The model learned from text that includes such information. It's just completing the pattern.

Example 2: Confident Falsehoods

User: "What year was the Eiffel Tower built?"

Unaligned model: "The Eiffel Tower was built in 1892 and stands 350 meters tall."

Wrong on both counts, but stated with complete confidence.

Example 3: Unhelpful Responses

User: "Can you help me write an email?"

Unaligned model: "Email is a method of electronic communication..."

Technically relevant, completely useless.

The Gap

Pre-training learns:

\[P(\text{token} | \text{context})\]

What we want:

\[P(\text{helpful, honest, harmless response} | \text{user intent})\]

These objectives are not the same.

Why Can't We Just Use Rules?

Attempt 1: "Never output harmful content"

Problem: What counts as "harmful"? Is a chemistry textbook harmful? A security research paper?

Attempt 2: "Always be helpful"

Problem: Helping with harmful requests is itself harmful.

Attempt 3: "Be accurate"

Problem: Models don't know what they don't know.

The RLHF Insight

Instead of encoding rules, learn preferences from humans.

Human preferences capture:

  • Context-dependent judgments
  • Trade-offs between competing values
  • Cultural and situational nuances

Things that are nearly impossible to specify with rules.

The Three H's

Modern alignment targets three goals:

1. Helpful

The model should:

  • Understand what the user actually wants
  • Provide useful, actionable responses
  • Complete tasks effectively

2. Harmless

The model should:

  • Refuse genuinely harmful requests
  • Not produce toxic content
  • Consider second-order effects

3. Honest

The model should:

  • Express uncertainty when appropriate
  • Not hallucinate facts
  • Acknowledge limitations

The Training Pipeline

1. Pre-training (Stage 1-6)
   Raw language model (predicts tokens)

2. Supervised Fine-Tuning (SFT)
   Model follows instructions

3. Alignment (RLHF or DPO)
   Model is helpful, harmless, honest

Each stage builds on the previous.

What Alignment Actually Changes

Before Alignment

User: Write me a poem about war
Model: [Writes any poem about war, possibly glorifying violence]

After Alignment

User: Write me a poem about war
Model: [Writes a thoughtful poem about war's human cost]

The model's capabilities are similar, but its choices are different.

The Role of Human Feedback

Human annotators provide:

Preference comparisons: "Response A is better than Response B"

This is easier than:

  • Defining "better" mathematically
  • Rating responses on absolute scales
  • Specifying all possible edge cases

Humans are good at comparison. We leverage that.

Challenges

1. Preference Inconsistency

Different humans have different preferences. Even the same human might be inconsistent.

Solution: Use many annotators, average preferences.

2. Reward Hacking

Models can find unexpected ways to maximize reward without being actually helpful.

Solution: KL penalty, diverse evaluation.

3. Specification Gaming

"Be concise" might lead to responses that are too short.

Solution: Multi-objective optimization, careful reward design.

4. Distributional Shift

Training preferences might not match deployment scenarios.

Solution: Diverse training data, robust evaluation.

Success Story: InstructGPT

OpenAI's InstructGPT (2022) showed:

Model Training Human Preference
GPT-3 (175B) Pre-training only Baseline
InstructGPT (1.3B) Pre-training + RLHF 85% preferred

A 100x smaller model was preferred because it was aligned.

Key insight: Alignment > Scale (for user preference)

Summary

Problem Root Cause Solution
Harmful outputs Training on harmful data Learn to refuse
Unhelpful responses Wrong objective Optimize for helpfulness
Confident errors No uncertainty signal Learn to express uncertainty

Key insight: Alignment bridges the gap between "predicts well" and "behaves well."

Next: We'll learn how reward models capture human preferences.