Section 9.1: The Fine-Tuning Problem¶
Reading time: 10 minutes
Why Fine-Tune?¶
Pretrained LLMs know a lot about language but nothing about your specific task:
- They don't know your company's terminology
- They don't follow your preferred output format
- They haven't seen your private data
Fine-tuning adapts a pretrained model to your specific needs.
The Naive Approach: Full Fine-Tuning¶
Just train all parameters on your data:
for batch in task_data:
loss = model.forward(batch)
gradients = model.backward(loss)
for param, grad in zip(model.parameters, gradients):
param -= learning_rate * grad # Update ALL parameters
The Problems¶
1. Memory
A 7B parameter model in fp32:
- Weights: 7B × 4 bytes = 28GB
- Gradients: 7B × 4 bytes = 28GB
- Optimizer states (Adam): 7B × 8 bytes = 56GB
- Total: ~112GB just for training!
Even with fp16/bf16, you need expensive hardware.
2. Storage
Each fine-tuned version is a complete copy:
- Customer support bot: 28GB
- Code assistant: 28GB
- Medical Q&A: 28GB
- 10 tasks = 280GB of model storage
3. Catastrophic Forgetting
Fine-tuning too aggressively destroys general capabilities:
Before: "What is the capital of France?" → "Paris"
After: "What is the capital of France?" → "Our return policy..."
The model forgets what it knew.
4. Overfitting
With billions of parameters and small task datasets:
Massive overfitting is almost guaranteed.
The Core Insight¶
Research by Aghajanyan et al. (2021) showed:
The intrinsic dimensionality of fine-tuning is remarkably low.
What does this mean?
The difference between pretrained weights and fine-tuned weights lives in a low-dimensional subspace.
Where \(\Delta W\) can be represented with far fewer than \(n \times m\) parameters.
Visualizing the Insight¶
Imagine weight space as a high-dimensional landscape:
Full parameter space (7B dimensions)
│
▼
╔══════════════════════════════════════════╗
║ ║
║ ● pretrained ║
║ \ ║
║ \──────▶ fine-tuned ║
║ (small update) ║
║ ║
╚══════════════════════════════════════════╝
The actual update path is LOW RANK:
It doesn't explore most of the 7B dimensions.
The Solution: PEFT¶
Instead of updating all parameters, update only:
- A small number of new parameters (LoRA, adapters)
- Input representations (prompt tuning)
- Attention patterns (prefix tuning)
These methods exploit the low intrinsic dimensionality of fine-tuning.
Comparison¶
| Approach | Trainable Params | Memory | Storage per Task | Risk |
|---|---|---|---|---|
| Full fine-tuning | 100% | Very high | Full model | High |
| LoRA | ~0.1% | Low | ~10MB | Low |
| Adapters | ~1% | Medium | ~100MB | Low |
| Prompt tuning | ~0.001% | Very low | ~100KB | Very low |
The PEFT Philosophy¶
Don't change what works—add what's needed.
The pretrained model already knows:
- Grammar and syntax
- World knowledge
- Reasoning patterns
Fine-tuning only needs to teach:
- Task-specific behavior
- Domain terminology
- Output format
This can be done with a tiny fraction of the parameters.
Mathematical Foundation¶
For a weight matrix \(W \in \mathbb{R}^{m \times n}\):
Full fine-tuning: Learn \(W' = W + \Delta W\) where \(\Delta W\) has \(mn\) parameters
LoRA: Learn \(W' = W + BA\) where:
- \(B \in \mathbb{R}^{m \times r}\)
- \(A \in \mathbb{R}^{r \times n}\)
- \(r \ll \min(m, n)\)
Parameters: \(r(m + n)\) instead of \(mn\)
Example:
- \(m = n = 4096\) (typical LLM dimension)
- \(r = 8\) (common LoRA rank)
- Full: \(4096 \times 4096 = 16.7M\) parameters
- LoRA: \(8 \times (4096 + 4096) = 65K\) parameters
- 256× reduction!
Summary¶
| Problem | PEFT Solution |
|---|---|
| Memory | Freeze most params, train few |
| Storage | Save only new params |
| Forgetting | Pretrained weights untouched |
| Overfitting | Far fewer params than data |
Key insight: Fine-tuning doesn't require exploring the full parameter space—just the right low-dimensional subspace.
Next: We'll dive into LoRA, the most popular PEFT method.