Section 9.2: LoRA (Low-Rank Adaptation)¶

Reading time: 15 minutes

The Big Idea¶

LoRA (Hu et al., 2021) is elegantly simple:

Instead of updating \(W\) directly, learn the update as a low-rank matrix:

\[W' = W + \Delta W = W + BA\]

Where:

\(W \in \mathbb{R}^{d \times k}\): Frozen pretrained weights
\(B \in \mathbb{R}^{d \times r}\): Trainable down-projection
\(A \in \mathbb{R}^{r \times k}\): Trainable up-projection
\(r \ll \min(d, k)\): The rank (typically 4-64)

Why It Works¶

The Low-Rank Hypothesis¶

The change in weights during fine-tuning has low intrinsic rank.

Think of it this way: fine-tuning adjusts the model slightly to perform a new task. These adjustments form patterns—they're not random noise.

Patterns can be captured with low-rank matrices.

Visual Intuition¶

Original W (4096 × 4096 = 16.7M params)
┌────────────────────────────────────────┐
│                                        │
│     [Full matrix - frozen]             │
│                                        │
└────────────────────────────────────────┘
                    +
LoRA: B × A (r=8: 65K params)
┌────┐   ┌────────────────────────────────┐
│    │   │                                │
│ B  │ × │              A                 │
│    │   │                                │
└────┘   └────────────────────────────────┘
(4096×8)        (8×4096)

The Math¶

Forward Pass¶

Standard linear layer:

\[h = Wx\]

With LoRA:

\[h = Wx + \frac{\alpha}{r}BAx\]

Where \(\frac{\alpha}{r}\) is a scaling factor.

Why Scaling?¶

When you change rank \(r\), you want behavior to stay consistent.

Higher \(r\): More capacity, but \(BA\) product changes magnitude
\(\alpha\) controls the "strength" of adaptation
Dividing by \(r\) normalizes across different ranks

Rule of thumb: Set \(\alpha = 2r\) (so scaling = 2)

Initialization¶

Critical for stable training:

\(A\): Random Gaussian with small variance
\(B\): Zeros

Why zeros for \(B\)? So that \(BA = 0\) at initialization. The model starts exactly as the pretrained model.

Implementation¶

class LoRALayer:
    """Low-Rank Adaptation layer."""

    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 8,
        alpha: float = 16.0,
    ):
        self.rank = rank
        self.scaling = alpha / rank

        # LoRA matrices
        self.A = np.random.randn(rank, in_features) * 0.01
        self.B = np.zeros((out_features, rank))  # Start at zero!

    def forward(self, x: np.ndarray, W: np.ndarray) -> np.ndarray:
        """h = Wx + scaling * BAx"""
        base = x @ W.T                    # Original path (frozen)
        lora = (x @ self.A.T) @ self.B.T  # LoRA path
        return base + self.scaling * lora

    def backward(self, grad_output, x):
        """Compute gradients for A and B only."""
        Ax = x @ self.A.T

        # Gradient w.r.t. B: grad @ Ax
        self.B_grad = grad_output.T @ Ax * self.scaling

        # Gradient w.r.t. A: B.T @ grad @ x
        grad_Ax = grad_output @ self.B
        self.A_grad = grad_Ax.T @ x * self.scaling

        return grad_output @ W  # Pass gradient to input

    def merge_weights(self, W):
        """Merge LoRA into base weights for inference."""
        return W + self.scaling * (self.B @ self.A)

Where to Apply LoRA¶

LoRA can be applied to any linear layer. Common choices:

Attention Layers¶

Target	Effect
Query (Q)	How tokens attend
Key (K)	What tokens are attended to
Value (V)	What information is retrieved
Output (O)	How attention output is projected

Most common: Apply to Q and V only.

Why Q and V?

Q determines what the model looks for
V determines what information is extracted
K and O are often less important for task adaptation

Feed-Forward Layers¶

Less common, but can help for knowledge-intensive tasks.

Hyperparameters¶

Rank (r)¶

Rank	Params (4096 dim)	Use Case
4	32K	Simple tasks, small data
8	65K	General purpose
16	131K	Complex tasks
64	524K	Very complex tasks

Start with r=8. Increase if underfitting.

Alpha (α)¶

Controls adaptation strength:

α = r: Scaling = 1 (conservative)
α = 2r: Scaling = 2 (common default)

Rule: If using learning rate that worked for full fine-tuning, start with α = 2r.

Target Modules¶

Which layers to apply LoRA:

# Conservative (fewer params, less risk)
target_modules = ['q_proj', 'v_proj']

# Aggressive (more params, more capacity)
target_modules = ['q_proj', 'k_proj', 'v_proj', 'o_proj',
                  'up_proj', 'down_proj']

The Magic: Weight Merging¶

After training, you can merge LoRA weights into the base model:

W_merged = W + scaling * (B @ A)

Benefits:

Zero inference overhead: Same speed as original model
Deployable anywhere: Just swap the weights
Stackable: Train multiple LoRAs, merge selectively

Training Procedure¶

# 1. Load pretrained model
model = load_pretrained("llama-7b")

# 2. Freeze all parameters
for param in model.parameters():
    param.requires_grad = False

# 3. Add LoRA layers
for layer in model.attention_layers:
    layer.q_proj = LoRALinear(layer.q_proj, rank=8)
    layer.v_proj = LoRALinear(layer.v_proj, rank=8)

# 4. Train only LoRA parameters
optimizer = Adam([p for p in model.lora_parameters()])

for batch in dataset:
    loss = model(batch)
    loss.backward()
    optimizer.step()

# 5. Optionally merge for deployment
for layer in model.attention_layers:
    layer.q_proj = layer.q_proj.merge()
    layer.v_proj = layer.v_proj.merge()

Comparison to Full Fine-Tuning¶

Aspect	Full Fine-Tuning	LoRA
Trainable params	100%	~0.1%
GPU memory	Very high	Low
Training speed	Slower	Faster
Storage per task	Full model	~10MB
Inference speed	Same	Same (merged)
Quality	Baseline	95-100%

When LoRA Works Best¶

Good for:

Instruction following
Style adaptation
Domain-specific tasks
Limited compute/memory

Might need more:

Learning entirely new knowledge
Very different output formats
Extreme domain shift

Common Mistakes¶

Forgetting to freeze base weights: Training everything defeats the purpose
Rank too low: Underfitting. Try increasing rank.
Rank too high: Overfitting. Also slower and more memory.
Wrong learning rate: LoRA often needs higher LR than full fine-tuning (try 1e-4 to 1e-3)
Not merging for production: Keep LoRA separate during experimentation, merge for deployment

Summary¶

Component	Purpose
A matrix	Projects input to low-rank space
B matrix	Projects from low-rank space to output
Rank r	Controls capacity (trade-off)
Alpha α	Controls adaptation strength
Scaling	Normalizes across ranks
Merging	Eliminates inference overhead

Key insight: LoRA exploits the fact that fine-tuning updates are inherently low-rank. By parameterizing them as such, we get massive efficiency gains with minimal quality loss.

Next: We'll explore adapters, another approach to parameter-efficient fine-tuning.