Stage 10: Alignment¶

Making models helpful, harmless, and honest

Overview¶

Pre-trained LLMs are powerful pattern matchers that predict next tokens. But prediction isn't enough—we want models that:

Help users accomplish their goals
Avoid harm even when prompted to cause it
Be honest about uncertainty and limitations

This gap between "predicts well" and "behaves well" is the alignment problem.

"A model that predicts text perfectly might still write harmful content perfectly."

The Core Challenge¶

Pre-training objective: Maximize \(P(\text{next token} | \text{context})\)

What we actually want: Maximize \(P(\text{human approves of response} | \text{context})\)

These are different! A model trained only on prediction might:

Generate toxic content if that's what the context suggests
Confidently state falsehoods
Help with harmful requests

The Solution: Preference Learning¶

Instead of defining "good behavior" with rules, we learn it from human preferences:

Show humans two model responses
Ask which is better
Train the model to produce more preferred responses

This is the foundation of both RLHF and DPO.

Methods We'll Cover¶

Method	Approach	Complexity
Reward Modeling	Learn a "goodness" function	Medium
RLHF (PPO)	Use RL to optimize for reward	High
DPO	Direct preference optimization	Low

Why This Matters¶

Alignment is what makes the difference between:

A text generator and an assistant
A pattern matcher and a helpful tool
A liability and a product

Most "AI safety" concerns are really about alignment.

Learning Objectives¶

By the end of this stage, you will:

Understand why alignment is necessary
Implement reward modeling with Bradley-Terry
Understand RLHF and PPO basics
Implement DPO from scratch
Know when to use each approach

Sections¶

The Alignment Problem - Why prediction isn't enough
Reward Modeling - Learning from preferences
RLHF with PPO - Reinforcement learning approach
Direct Preference Optimization - A simpler alternative
Choosing a Method - Trade-offs and recommendations
Implementation - Building alignment from scratch

Prerequisites¶

Understanding of neural network training (Stage 2-3)
Familiarity with language model architecture (Stage 6)
Basic probability and optimization concepts

Key Insight¶

Alignment doesn't require solving the "hard problem" of defining what's good. It requires learning from human judgments—which humans are very good at providing.

Code & Resources¶

Resource	Description
`code/stage-10/alignment.py`	Reward Model, RLHF, and DPO
`code/stage-10/tests/`	Test suite
Exercises	Practice problems
Common Mistakes	Debugging guide