17 Investigation: LoRA
The Low-Rank Structure of Fine-Tuning
GPT-3 has 175 billion parameters. Fine-tuning them requires 700+ GB of memory.
LoRA fine-tunes with 0.01% of the parameters—and often works just as well.
How is this possible?
This chapter is a case study in separability—the second property from our Algebraic Framework.
When a structure \(A\) can be expressed as \(UV^T\) where \(U\) and \(V\) are much smaller than \(A\), we have a separable (low-rank) structure. LoRA’s insight is that fine-tuning updates \(\Delta W\) are separable: they live in a low-dimensional subspace.
This chapter investigates why fine-tuning is low-rank and how to exploit that structure.
17.1 The Puzzle
Let’s put numbers to the puzzle.
Full fine-tuning of a 7B model: - Weights: 7B × 4 bytes = 28 GB - Gradients: 28 GB - Optimizer states (AdamW): 56 GB - Total: ~112 GB minimum
LoRA fine-tuning of the same model (rank 8): - Original weights (frozen): 28 GB (inference only, no gradients) - LoRA adapters: ~17M parameters × 4 bytes ≈ 68 MB - LoRA gradients: 68 MB - LoRA optimizer: 136 MB - Total: ~28 GB (mostly frozen inference weights)
LoRA reduces training memory by 4× while training 0.24% as many parameters.
The remarkable part: this often works just as well, or better.
17.2 The Hypothesis
LoRA’s hypothesis: fine-tuning is low-rank.
When you fine-tune a pretrained model, the weight update \(\Delta W\) doesn’t need all \(d \times k\) degrees of freedom. It lives in a much lower-dimensional subspace.
Mathematically, instead of:
\[W_{new} = W_{pretrained} + \Delta W\]
where \(\Delta W \in \mathbb{R}^{d \times k}\) has \(dk\) parameters, we parameterize:
\[W_{new} = W_{pretrained} + BA\]
where \(B \in \mathbb{R}^{d \times r}\) and \(A \in \mathbb{R}^{r \times k}\), with \(r \ll \min(d, k)\).
Parameters: \(r(d + k)\) instead of \(dk\).
For a typical transformer linear layer (\(d = k = 4096\)) with rank 8: - Full: 16.8M parameters - LoRA: 65K parameters - Compression: 256×
17.3 Testing the Hypothesis
Before trusting LoRA, let’s test whether fine-tuning actually is low-rank.
17.3.1 Experiment 1: Singular Value Decay
Take a model, fine-tune it fully, and analyze the weight change:
import torch
import numpy as np
from transformers import AutoModelForCausalLM
# Load model before and after fine-tuning
model_before = AutoModelForCausalLM.from_pretrained("gpt2")
model_after = AutoModelForCausalLM.from_pretrained("gpt2-finetuned-sst2")
def analyze_weight_change(name, before, after):
"""Analyze the rank structure of a weight change."""
delta_W = (after - before).float().cpu().numpy()
# SVD of the change
U, S, Vt = np.linalg.svd(delta_W, full_matrices=False)
# Normalize singular values
S_norm = S / S[0] if S[0] > 0 else S
# Compute effective rank (energy threshold)
energy = np.cumsum(S**2) / np.sum(S**2)
rank_90 = np.searchsorted(energy, 0.90) + 1
rank_99 = np.searchsorted(energy, 0.99) + 1
print(f"{name}:")
print(f" Shape: {delta_W.shape}")
print(f" Rank for 90% energy: {rank_90}")
print(f" Rank for 99% energy: {rank_99}")
print(f" σ₁/σ₁₀: {S[0]/S[9]:.1f}")
return S
# Analyze attention layers
for name, (p_before, p_after) in zip(
model_before.named_parameters(),
model_after.named_parameters()
):
if 'attn' in name[0] and 'weight' in name[0]:
S = analyze_weight_change(name[0], p_before.data, p_after.data)Typical results show dramatic singular value decay:
transformer.h.0.attn.c_attn.weight:
Shape: (2304, 768)
Rank for 90% energy: 4
Rank for 99% energy: 12
σ₁/σ₁₀: 47.3
transformer.h.5.attn.c_attn.weight:
Shape: (2304, 768)
Rank for 90% energy: 6
Rank for 99% energy: 18
σ₁/σ₁₀: 31.2
Observation: 90% of the fine-tuning update lives in a rank-4 to rank-6 subspace. The update is very low-rank.
17.3.2 Experiment 2: Intrinsic Dimensionality
A more rigorous test: what’s the minimum rank needed to match full fine-tuning?
def test_intrinsic_rank(model, dataset, ranks=[1, 2, 4, 8, 16, 32, 64]):
"""Test different LoRA ranks on the same task."""
results = {}
for rank in ranks:
# Apply LoRA with given rank
lora_model = apply_lora(model, rank=rank)
# Fine-tune
trainer = Trainer(lora_model, dataset)
trainer.train()
# Evaluate
accuracy = evaluate(lora_model, dataset)
results[rank] = accuracy
print(f"Rank {rank:3d}: {accuracy:.1%}")
return results
# Results from Aghajanyan et al. (2020)
# Task: RTE (small classification dataset)
# Model: RoBERTa-base
# Rank Accuracy
# 1 60.2%
# 4 64.5%
# 8 66.1%
# 64 67.0%
# 256 67.3%
# Full 67.5% (768 × 768 = 589K per layer)Key finding: Rank 64 captures 99.3% of full fine-tuning performance. Rank 8 captures 98%.
The intrinsic dimensionality of fine-tuning is much lower than the parameter count suggests.
17.4 Why Is Fine-Tuning Low-Rank?
Several factors explain this phenomenon:
17.4.1 1. Pretrained Knowledge
The pretrained model already knows general representations: - Syntax and grammar - World knowledge - Reasoning patterns
Fine-tuning isn’t rebuilding these capabilities—it’s steering existing capabilities toward a specific task.
Steering requires fewer dimensions than building from scratch.
Analogy:
Building a house: Need full 3D freedom (millions of choices)
Rearranging furniture: Need only a few dimensions (positions, orientations)
Pretraining: Builds the house
Fine-tuning: Rearranges the furniture
17.4.2 2. Task-Specific Information is Sparse
Most fine-tuning tasks add relatively little new information: - Classification: Learn to map existing features to labels - Style transfer: Adjust generation patterns slightly - Domain adaptation: Shift probabilities within existing vocabulary
The “new” information is low-dimensional relative to the model’s total capacity.
17.4.3 3. Overparameterization
Large language models are massively overparameterized. They have far more parameters than needed to fit their training data (in an information-theoretic sense).
This redundancy means: - Many weight configurations produce similar behavior - Small perturbations (low-rank updates) can produce significant behavior changes - The model is robust to low-rank approximations of the update
17.4.4 4. Implicit Regularization
LoRA’s low-rank constraint acts as regularization: - Prevents overfitting to small fine-tuning datasets - Encourages smooth interpolation between pretrained and fine-tuned behavior - Often improves generalization
17.5 The LoRA Implementation
Here’s a complete, production-quality LoRA implementation:
import torch
import torch.nn as nn
import torch.nn.functional as F
import math
class LoRALayer(nn.Module):
"""
Low-Rank Adaptation layer.
Adds a low-rank update to a frozen linear layer.
"""
def __init__(
self,
original_layer: nn.Linear,
rank: int = 8,
alpha: float = 16,
dropout: float = 0.0,
):
super().__init__()
self.original = original_layer
self.rank = rank
self.alpha = alpha
in_features = original_layer.in_features
out_features = original_layer.out_features
# Freeze original weights
for param in self.original.parameters():
param.requires_grad = False
# Low-rank matrices
self.A = nn.Linear(in_features, rank, bias=False)
self.B = nn.Linear(rank, out_features, bias=False)
self.dropout = nn.Dropout(dropout) if dropout > 0 else nn.Identity()
# Scaling factor
self.scaling = alpha / rank
# Initialize
nn.init.kaiming_uniform_(self.A.weight, a=math.sqrt(5))
nn.init.zeros_(self.B.weight)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# Original forward (frozen)
original_output = self.original(x)
# LoRA forward (trainable)
lora_output = self.B(self.A(self.dropout(x)))
return original_output + lora_output * self.scaling
def merge_weights(self) -> nn.Linear:
"""Merge LoRA weights into original for efficient inference."""
merged = nn.Linear(
self.original.in_features,
self.original.out_features,
bias=self.original.bias is not None
)
# W_merged = W_original + B @ A * scaling
delta_W = self.B.weight @ self.A.weight * self.scaling
merged.weight.data = self.original.weight.data + delta_W
if self.original.bias is not None:
merged.bias.data = self.original.bias.data
return merged
def apply_lora_to_model(model, target_modules=['q_proj', 'v_proj'], rank=8, alpha=16):
"""Apply LoRA to specified modules in a model."""
for name, module in model.named_modules():
if any(target in name for target in target_modules):
if isinstance(module, nn.Linear):
# Replace with LoRA version
parent_name = '.'.join(name.split('.')[:-1])
child_name = name.split('.')[-1]
parent = model.get_submodule(parent_name)
lora_layer = LoRALayer(module, rank=rank, alpha=alpha)
setattr(parent, child_name, lora_layer)
return model17.5.1 Key Design Choices
Initialization: A is initialized with Kaiming, B with zeros.
This means at initialization, \(BA = 0\), so the model starts exactly at the pretrained weights. Training gradually adds the low-rank update.
Scaling: The alpha / rank scaling controls the magnitude of updates.
Higher alpha → larger updates. The scaling is independent of rank, so you can compare different ranks fairly.
Which layers to adapt: Typically attention projections (Q, K, V, O).
The original LoRA paper found that Q and V gave the best results per parameter. K and FFN layers help less.
# Typical target modules for different architectures
# GPT-style (GPT-2, LLaMA, Mistral)
target_modules = ['q_proj', 'v_proj'] # Conservative
target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj'] # Full attention
target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj',
'gate_proj', 'up_proj', 'down_proj'] # All linear
# BERT-style
target_modules = ['query', 'value'] # Conservative
target_modules = ['query', 'key', 'value', 'dense'] # Full17.6 When LoRA Fails
LoRA isn’t magic. It fails when:
17.6.1 1. Target Task Is Far from Pretraining
If the fine-tuning task requires fundamentally new knowledge:
# This works well (close to pretraining):
# - Sentiment classification (model already knows sentiment)
# - Summarization (model already knows language)
# - Translation between seen language pairs
# This may need higher rank or full fine-tuning:
# - New language not in pretraining
# - Specialized scientific domains
# - Novel reasoning patterns (math, code)17.6.2 2. Fine-Tuning Dataset Is Very Large
For large datasets, the optimal update may be higher rank:
# Dataset size vs. optimal rank (empirical)
# 1K examples: rank 4-8 sufficient
# 10K examples: rank 8-16 optimal
# 100K examples: rank 16-64 may help
# 1M+ examples: Consider full fine-tuning17.6.3 3. The Task Requires Precise Numerical Reasoning
Mathematical reasoning, formal verification, and similar tasks often benefit from more capacity:
# GSM8K (math word problems)
# Rank 8: +2.3% over base
# Rank 64: +4.1% over base
# Full FT: +5.2% over base
# Unlike classification, math reasoning benefits from higher ranks17.6.4 Diagnostic: Compare LoRA Ranks
When in doubt, try multiple ranks:
def diagnose_rank_requirement(model, dataset):
"""Find the minimum rank that matches full fine-tuning."""
results = {}
for rank in [4, 8, 16, 32, 64, 128]:
model_copy = copy.deepcopy(model)
lora_model = apply_lora_to_model(model_copy, rank=rank)
train(lora_model, dataset)
results[rank] = evaluate(lora_model)
# Also try full fine-tuning
model_full = copy.deepcopy(model)
train(model_full, dataset)
results['full'] = evaluate(model_full)
# Find minimum rank that gets within 1% of full
for rank in sorted([r for r in results if isinstance(r, int)]):
if results[rank] >= results['full'] * 0.99:
print(f"Rank {rank} sufficient (within 1% of full)")
break
return results17.7 QLoRA: Combining with Quantization
QLoRA combines LoRA with 4-bit quantization for even greater memory savings:
# Memory comparison for LLaMA-65B
# Full fine-tuning (FP16)
# Weights: 130 GB
# Gradients: 130 GB
# Optimizer: 260 GB
# Total: ~520 GB (needs 8× A100 80GB)
# LoRA (FP16 base + FP16 adapters)
# Weights: 130 GB (frozen, inference only)
# Adapters: 0.4 GB
# Gradients: 0.4 GB
# Optimizer: 0.8 GB
# Total: ~132 GB (needs 2× A100 80GB)
# QLoRA (4-bit base + FP16 adapters)
# Weights: 33 GB (4-bit quantized)
# Adapters: 0.4 GB
# Gradients: 0.4 GB
# Optimizer: 0.8 GB
# Total: ~35 GB (fits on 1× A100 80GB!)QLoRA enables fine-tuning 65B models on a single GPU.
from transformers import BitsAndBytesConfig
from peft import get_peft_model, LoraConfig
# QLoRA configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Nested quantization
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-70b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# Add LoRA adapters
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
)
model = get_peft_model(model, lora_config)
# Total trainable: 0.02% of parameters17.8 The Hardware Perspective
LoRA’s efficiency isn’t just about parameter count—it’s about memory access patterns.
17.8.1 Training Memory
Full fine-tuning: - Store gradients for all 7B parameters - Store optimizer states (2× parameters for Adam) - Memory scales with model size
LoRA: - Store gradients only for adapter parameters (~17M) - Optimizer states only for adapters - Memory scales with adapter size, not model size
17.8.2 Inference Efficiency
Option 1: Keep adapters separate
# Forward pass with separate adapters
def forward_with_adapters(x, W_frozen, A, B, alpha, rank):
y = W_frozen @ x + (B @ A @ x) * (alpha / rank)
return y
# Pro: Can switch adapters easily
# Con: Extra compute and memory for adapter forward passOption 2: Merge adapters into weights
# Merge once at load time
W_merged = W_frozen + B @ A * (alpha / rank)
def forward_merged(x, W_merged):
return W_merged @ x
# Pro: Same speed as original model
# Con: Can't switch adapters without reloadingFor serving multiple LoRA adapters (multi-tenant):
# S-LoRA approach: batch different adapters
# Each request specifies which adapter to use
# GPU kernel handles the routing
def batched_lora_forward(x_batch, adapter_ids, W_frozen, adapters_A, adapters_B):
"""
x_batch: (batch, seq, dim)
adapter_ids: (batch,) - which adapter for each example
adapters_A, adapters_B: stored per-adapter weights
"""
# Base output
y = W_frozen @ x_batch
# Add adapter contributions (efficiently batched)
for adapter_id in unique(adapter_ids):
mask = (adapter_ids == adapter_id)
A, B = adapters_A[adapter_id], adapters_B[adapter_id]
y[mask] += (B @ A @ x_batch[mask]) * scaling
return y17.9 The Derivation Pattern
How would you discover LoRA if it didn’t exist?
Observe the problem: Fine-tuning is expensive; most users can’t afford it
Ask the key question: Does fine-tuning really need all those parameters?
Test the hypothesis: Measure singular value decay of weight updates. Find that it’s very low-rank.
Understand why: Pretrained models already have capabilities; fine-tuning is steering, not building
Design the solution: Parameterize the update as low-rank directly. This is LoRA.
Validate empirically: Compare across tasks, ranks, model sizes. Find that it works remarkably well.
17.10 Key Takeaways
Fine-tuning is low-rank: The weight updates during fine-tuning live in a low-dimensional subspace. This is empirically verifiable.
LoRA exploits this structure: By parameterizing updates as low-rank, we reduce parameters by 100-1000× with minimal accuracy loss.
The bet can fail: When tasks are far from pretraining or datasets are large, higher ranks or full fine-tuning may be needed.
QLoRA extends the idea: Combine with quantization for even greater memory savings.
Multiple adapters are efficient: LoRA enables multi-tenant serving where different users have different fine-tunes.
17.11 Connections
Chapter 5 (Factoring): LoRA is matrix factorization applied to weight updates. The same mathematical machinery as SVD and MobileNets.
Chapter 13 (Quantization): QLoRA combines the insights of low-rank structure and reduced precision.
Chapter 4 (Chunking): LoRA training can be parallelized—adapters for different layers are independent.
17.12 Further Reading
- Hu et al. (2021). “LoRA: Low-Rank Adaptation of Large Language Models”
- Aghajanyan et al. (2020). “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning”
- Dettmers et al. (2023). “QLoRA: Efficient Finetuning of Quantized LLMs”
- Sheng et al. (2023). “S-LoRA: Serving Thousands of Concurrent LoRA Adapters”