Section 9.2: LoRA (Low-Rank Adaptation)¶
Reading time: 15 minutes
The Big Idea¶
LoRA (Hu et al., 2021) is elegantly simple:
Instead of updating \(W\) directly, learn the update as a low-rank matrix:
Where:
- \(W \in \mathbb{R}^{d \times k}\): Frozen pretrained weights
- \(B \in \mathbb{R}^{d \times r}\): Trainable down-projection
- \(A \in \mathbb{R}^{r \times k}\): Trainable up-projection
- \(r \ll \min(d, k)\): The rank (typically 4-64)
Why It Works¶
The Low-Rank Hypothesis¶
The change in weights during fine-tuning has low intrinsic rank.
Think of it this way: fine-tuning adjusts the model slightly to perform a new task. These adjustments form patterns—they're not random noise.
Patterns can be captured with low-rank matrices.
Visual Intuition¶
Original W (4096 × 4096 = 16.7M params)
┌────────────────────────────────────────┐
│ │
│ [Full matrix - frozen] │
│ │
└────────────────────────────────────────┘
+
LoRA: B × A (r=8: 65K params)
┌────┐ ┌────────────────────────────────┐
│ │ │ │
│ B │ × │ A │
│ │ │ │
└────┘ └────────────────────────────────┘
(4096×8) (8×4096)
The Math¶
Forward Pass¶
Standard linear layer:
With LoRA:
Where \(\frac{\alpha}{r}\) is a scaling factor.
Why Scaling?¶
When you change rank \(r\), you want behavior to stay consistent.
- Higher \(r\): More capacity, but \(BA\) product changes magnitude
- \(\alpha\) controls the "strength" of adaptation
- Dividing by \(r\) normalizes across different ranks
Rule of thumb: Set \(\alpha = 2r\) (so scaling = 2)
Initialization¶
Critical for stable training:
- \(A\): Random Gaussian with small variance
- \(B\): Zeros
Why zeros for \(B\)? So that \(BA = 0\) at initialization. The model starts exactly as the pretrained model.
Implementation¶
class LoRALayer:
"""Low-Rank Adaptation layer."""
def __init__(
self,
in_features: int,
out_features: int,
rank: int = 8,
alpha: float = 16.0,
):
self.rank = rank
self.scaling = alpha / rank
# LoRA matrices
self.A = np.random.randn(rank, in_features) * 0.01
self.B = np.zeros((out_features, rank)) # Start at zero!
def forward(self, x: np.ndarray, W: np.ndarray) -> np.ndarray:
"""h = Wx + scaling * BAx"""
base = x @ W.T # Original path (frozen)
lora = (x @ self.A.T) @ self.B.T # LoRA path
return base + self.scaling * lora
def backward(self, grad_output, x):
"""Compute gradients for A and B only."""
Ax = x @ self.A.T
# Gradient w.r.t. B: grad @ Ax
self.B_grad = grad_output.T @ Ax * self.scaling
# Gradient w.r.t. A: B.T @ grad @ x
grad_Ax = grad_output @ self.B
self.A_grad = grad_Ax.T @ x * self.scaling
return grad_output @ W # Pass gradient to input
def merge_weights(self, W):
"""Merge LoRA into base weights for inference."""
return W + self.scaling * (self.B @ self.A)
Where to Apply LoRA¶
LoRA can be applied to any linear layer. Common choices:
Attention Layers¶
| Target | Effect |
|---|---|
| Query (Q) | How tokens attend |
| Key (K) | What tokens are attended to |
| Value (V) | What information is retrieved |
| Output (O) | How attention output is projected |
Most common: Apply to Q and V only.
Why Q and V?
- Q determines what the model looks for
- V determines what information is extracted
- K and O are often less important for task adaptation
Feed-Forward Layers¶
Less common, but can help for knowledge-intensive tasks.
Hyperparameters¶
Rank (r)¶
| Rank | Params (4096 dim) | Use Case |
|---|---|---|
| 4 | 32K | Simple tasks, small data |
| 8 | 65K | General purpose |
| 16 | 131K | Complex tasks |
| 64 | 524K | Very complex tasks |
Start with r=8. Increase if underfitting.
Alpha (α)¶
Controls adaptation strength:
- α = r: Scaling = 1 (conservative)
- α = 2r: Scaling = 2 (common default)
Rule: If using learning rate that worked for full fine-tuning, start with α = 2r.
Target Modules¶
Which layers to apply LoRA:
# Conservative (fewer params, less risk)
target_modules = ['q_proj', 'v_proj']
# Aggressive (more params, more capacity)
target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj',
'up_proj', 'down_proj']
The Magic: Weight Merging¶
After training, you can merge LoRA weights into the base model:
Benefits:
- Zero inference overhead: Same speed as original model
- Deployable anywhere: Just swap the weights
- Stackable: Train multiple LoRAs, merge selectively
Training Procedure¶
# 1. Load pretrained model
model = load_pretrained("llama-7b")
# 2. Freeze all parameters
for param in model.parameters():
param.requires_grad = False
# 3. Add LoRA layers
for layer in model.attention_layers:
layer.q_proj = LoRALinear(layer.q_proj, rank=8)
layer.v_proj = LoRALinear(layer.v_proj, rank=8)
# 4. Train only LoRA parameters
optimizer = Adam([p for p in model.lora_parameters()])
for batch in dataset:
loss = model(batch)
loss.backward()
optimizer.step()
# 5. Optionally merge for deployment
for layer in model.attention_layers:
layer.q_proj = layer.q_proj.merge()
layer.v_proj = layer.v_proj.merge()
Comparison to Full Fine-Tuning¶
| Aspect | Full Fine-Tuning | LoRA |
|---|---|---|
| Trainable params | 100% | ~0.1% |
| GPU memory | Very high | Low |
| Training speed | Slower | Faster |
| Storage per task | Full model | ~10MB |
| Inference speed | Same | Same (merged) |
| Quality | Baseline | 95-100% |
When LoRA Works Best¶
Good for:
- Instruction following
- Style adaptation
- Domain-specific tasks
- Limited compute/memory
Might need more:
- Learning entirely new knowledge
- Very different output formats
- Extreme domain shift
Common Mistakes¶
- Forgetting to freeze base weights: Training everything defeats the purpose
- Rank too low: Underfitting. Try increasing rank.
- Rank too high: Overfitting. Also slower and more memory.
- Wrong learning rate: LoRA often needs higher LR than full fine-tuning (try 1e-4 to 1e-3)
- Not merging for production: Keep LoRA separate during experimentation, merge for deployment
Summary¶
| Component | Purpose |
|---|---|
| A matrix | Projects input to low-rank space |
| B matrix | Projects from low-rank space to output |
| Rank r | Controls capacity (trade-off) |
| Alpha α | Controls adaptation strength |
| Scaling | Normalizes across ranks |
| Merging | Eliminates inference overhead |
Key insight: LoRA exploits the fact that fine-tuning updates are inherently low-rank. By parameterizing them as such, we get massive efficiency gains with minimal quality loss.
Next: We'll explore adapters, another approach to parameter-efficient fine-tuning.