Section 9.6: Implementation¶
Reading time: 15 minutes
Overview¶
In this section, we implement four PEFT methods from scratch:
- LoRA: Low-rank adaptation
- Adapters: Bottleneck layers
- Prefix Tuning: Learned key/value prefixes
- Prompt Tuning: Soft input prompts
All code is available in code/stage-09/peft.py.
LoRA Implementation¶
The core LoRA layer with full forward and backward passes:
class LoRALayer:
"""Low-Rank Adaptation layer."""
def __init__(
self,
in_features: int,
out_features: int,
rank: int = 8,
alpha: float = 16.0,
):
self.in_features = in_features
self.out_features = out_features
self.rank = rank
self.scaling = alpha / rank
# LoRA matrices
# A: random init, B: zero init (start as identity)
self.A = np.random.randn(rank, in_features) * 0.01
self.B = np.zeros((out_features, rank))
self.cache = {}
def forward(self, x: np.ndarray, W: np.ndarray) -> np.ndarray:
"""h = Wx + scaling * BAx"""
# Original path (frozen)
base_output = x @ W.T
# LoRA path
lora_output = (x @ self.A.T) @ self.B.T
# Cache for backward
self.cache = {'x': x, 'W': W, 'Ax': x @ self.A.T}
return base_output + self.scaling * lora_output
def backward(self, grad_output: np.ndarray) -> np.ndarray:
"""Compute gradients for A and B."""
x = self.cache['x']
W = self.cache['W']
Ax = self.cache['Ax']
scaled_grad = grad_output * self.scaling
# Handle batched inputs
if x.ndim == 3:
x_flat = x.reshape(-1, self.in_features)
Ax_flat = Ax.reshape(-1, self.rank)
grad_flat = scaled_grad.reshape(-1, self.out_features)
else:
x_flat, Ax_flat, grad_flat = x, Ax, scaled_grad
# Gradient w.r.t. B
self.B_grad = grad_flat.T @ Ax_flat
# Gradient w.r.t. A
grad_Ax = grad_flat @ self.B
self.A_grad = grad_Ax.T @ x_flat
# Gradient w.r.t. input
grad_x = grad_output @ W + scaled_grad @ self.B @ self.A
return grad_x
def merge_weights(self, W: np.ndarray) -> np.ndarray:
"""Merge LoRA into base weights for inference."""
return W + self.scaling * (self.B @ self.A)
def num_parameters(self) -> int:
return self.A.size + self.B.size
LoRA-Enhanced Linear Layer¶
Wrapping a pretrained layer with LoRA:
class LoRALinear:
"""Linear layer with LoRA adaptation."""
def __init__(
self,
weight: np.ndarray,
bias: Optional[np.ndarray] = None,
lora_rank: int = 8,
lora_alpha: float = 16.0,
):
self.weight = weight # Frozen
self.bias = bias # Frozen
self.out_features, self.in_features = weight.shape
self.lora = LoRALayer(
in_features=self.in_features,
out_features=self.out_features,
rank=lora_rank,
alpha=lora_alpha,
)
def forward(self, x):
output = self.lora.forward(x, self.weight)
if self.bias is not None:
output = output + self.bias
return output
def merge(self):
"""Return merged weight and bias."""
return self.lora.merge_weights(self.weight), self.bias
Adapter Implementation¶
class Adapter:
"""Bottleneck adapter layer."""
def __init__(
self,
d_model: int,
bottleneck_dim: int = 64,
):
self.d_model = d_model
self.bottleneck_dim = bottleneck_dim
# Small initialization (near-identity at start)
scale = 0.01
self.W_down = np.random.randn(d_model, bottleneck_dim) * scale
self.W_up = np.random.randn(bottleneck_dim, d_model) * scale
self.cache = {}
def forward(self, x: np.ndarray) -> np.ndarray:
"""x + adapter(x)"""
# Down-project
down = x @ self.W_down
# ReLU activation
activated = np.maximum(0, down)
# Up-project
up = activated @ self.W_up
# Cache for backward
self.cache = {'x': x, 'down': down, 'activated': activated}
# Residual connection
return x + up
def backward(self, grad_output: np.ndarray) -> np.ndarray:
"""Compute gradients."""
x = self.cache['x']
down = self.cache['down']
activated = self.cache['activated']
# Gradient through up-projection
activated_flat = activated.reshape(-1, self.bottleneck_dim)
grad_flat = grad_output.reshape(-1, self.d_model)
self.W_up_grad = activated_flat.T @ grad_flat
# Gradient through ReLU
grad_activated = grad_output @ self.W_up.T
grad_down = grad_activated * (down > 0)
# Gradient through down-projection
x_flat = x.reshape(-1, self.d_model)
grad_down_flat = grad_down.reshape(-1, self.bottleneck_dim)
self.W_down_grad = x_flat.T @ grad_down_flat
# Gradient to input (residual + adapter path)
return grad_output + grad_down @ self.W_down.T
def num_parameters(self) -> int:
return self.W_down.size + self.W_up.size
Prefix Tuning Implementation¶
class PrefixTuning:
"""Learned key/value prefixes for attention."""
def __init__(
self,
num_layers: int,
num_heads: int,
d_head: int,
prefix_length: int = 10,
):
self.num_layers = num_layers
self.num_heads = num_heads
self.d_head = d_head
self.prefix_length = prefix_length
# Shape: [num_layers, 2 (K,V), prefix_length, num_heads, d_head]
self.prefix = np.random.randn(
num_layers, 2, prefix_length, num_heads, d_head
) * 0.01
def get_prefix(self, layer_idx: int):
"""Get K and V prefixes for a layer."""
prefix_k = self.prefix[layer_idx, 0]
prefix_v = self.prefix[layer_idx, 1]
return prefix_k, prefix_v
def num_parameters(self) -> int:
return self.prefix.size
Prompt Tuning Implementation¶
class PromptTuning:
"""Learned soft prompts prepended to input."""
def __init__(
self,
d_model: int,
prompt_length: int = 20,
init_from_vocab: Optional[np.ndarray] = None,
):
self.d_model = d_model
self.prompt_length = prompt_length
if init_from_vocab is not None:
# Initialize from random vocabulary embeddings
indices = np.random.choice(len(init_from_vocab), prompt_length)
self.prompt = init_from_vocab[indices].copy()
else:
self.prompt = np.random.randn(prompt_length, d_model) * 0.01
def forward(self, input_embeds: np.ndarray) -> np.ndarray:
"""Prepend soft prompts to input."""
batch_size = input_embeds.shape[0]
# Expand prompt for batch
prompt_expanded = np.broadcast_to(
self.prompt[np.newaxis, :, :],
(batch_size, self.prompt_length, self.d_model)
).copy()
return np.concatenate([prompt_expanded, input_embeds], axis=1)
def backward(self, grad_output: np.ndarray) -> np.ndarray:
"""Compute gradient for prompts."""
# Gradient for prompt (sum over batch)
self.prompt_grad = grad_output[:, :self.prompt_length].sum(axis=0)
# Pass through gradient for actual input
return grad_output[:, self.prompt_length:]
def num_parameters(self) -> int:
return self.prompt.size
Utility Functions¶
Apply LoRA to Attention¶
def apply_lora_to_attention(
wq: np.ndarray,
wk: np.ndarray,
wv: np.ndarray,
wo: np.ndarray,
rank: int = 8,
alpha: float = 16.0,
target_modules: List[str] = ['q', 'v'],
) -> Dict[str, Any]:
"""Apply LoRA to attention weight matrices."""
layers = {}
weights = {'q': wq, 'k': wk, 'v': wv, 'o': wo}
for name, weight in weights.items():
if name in target_modules:
layers[name] = LoRALinear(weight, lora_rank=rank, lora_alpha=alpha)
else:
layers[name] = weight # Keep frozen
return layers
Compare PEFT Methods¶
def compare_peft_methods(
d_model: int = 4096,
num_layers: int = 32,
num_heads: int = 32,
lora_rank: int = 8,
adapter_bottleneck: int = 64,
prefix_length: int = 10,
prompt_length: int = 20,
) -> Dict[str, Dict]:
"""Compare parameter counts across PEFT methods."""
d_head = d_model // num_heads
# Full fine-tuning: all attention weights (Q, K, V, O per layer)
full_params = num_layers * 4 * d_model * d_model
# LoRA on Q and V
lora_params = num_layers * 2 * lora_rank * (d_model + d_model)
# Adapters (one after attention, one after FFN)
adapter_params = num_layers * 2 * 2 * d_model * adapter_bottleneck
# Prefix tuning
prefix_params = num_layers * 2 * prefix_length * num_heads * d_head
# Prompt tuning
prompt_params = prompt_length * d_model
return {
'full': {'params': full_params, 'ratio': 1.0},
'lora': {'params': lora_params, 'ratio': lora_params / full_params},
'adapters': {'params': adapter_params, 'ratio': adapter_params / full_params},
'prefix': {'params': prefix_params, 'ratio': prefix_params / full_params},
'prompt': {'params': prompt_params, 'ratio': prompt_params / full_params},
}
Running the Demo¶
Output:
============================================================
Stage 9: Fine-tuning & Parameter-Efficient Methods
============================================================
1. LoRA (Low-Rank Adaptation)
----------------------------------------
Original weight shape: (256, 256)
LoRA A shape: (8, 256)
LoRA B shape: (256, 8)
LoRA parameters: 4,096
Full parameters: 65,536
Compression: 6.25%
Input shape: (2, 16, 256)
Output shape: (2, 16, 256)
Gradients computed: A=(8, 256), B=(256, 8)
2. Adapter Layers
----------------------------------------
Adapter parameters: 32,768
Bottleneck dimension: 64
3. Prompt Tuning
----------------------------------------
Original sequence length: 16
With soft prompt: 36
Prompt parameters: 5,120
4. PEFT Method Comparison (7B model scale)
----------------------------------------
Method Parameters Ratio Description
----------------------------------------------------------------------
full_fine_tuning 2,147,483,648 100.0000% Update all attention weights
lora 4,194,304 0.1953% LoRA rank=8 on Q,V
adapters 33,554,432 1.5625% Bottleneck=64
prefix_tuning 2,621,440 0.1221% 10 prefix tokens
prompt_tuning 81,920 0.0038% 20 soft prompts
Summary¶
| Method | Key Classes | Parameters |
|---|---|---|
| LoRA | LoRALayer, LoRALinear |
A, B matrices |
| Adapters | Adapter |
W_down, W_up |
| Prefix Tuning | PrefixTuning |
prefix tensor |
| Prompt Tuning | PromptTuning |
prompt embeddings |
All implementations follow the same pattern:
- Initialize trainable parameters
- Implement forward pass
- Implement backward pass
- Provide parameter count
Exercises¶
- Extend LoRA: Apply to feed-forward layers
- LoRA dropout: Add dropout to the LoRA path
- Adapter variants: Implement parallel adapters
- Prefix MLP: Generate prefixes from a learned embedding via MLP
- Combination: Implement LoRA + Prompt Tuning together
- Quantization: Add 4-bit LoRA (QLoRA)
What's Next¶
With PEFT methods mastered, we're ready for Stage 10: Alignment—teaching models to be helpful, harmless, and honest through RLHF and related techniques.