Section 9.5: Choosing a Method¶
Reading time: 8 minutes
The Decision Framework¶
With multiple PEFT methods available, how do you choose? Consider these factors:
- Task complexity
- Available compute
- Inference requirements
- Data size
Quick Reference¶
| If You Need... | Use |
|---|---|
| Maximum efficiency | Prompt Tuning |
| Balance of quality and efficiency | LoRA |
| Maximum quality (near full fine-tuning) | Adapters or LoRA with high rank |
| Zero inference overhead | LoRA (merged) |
| Quick experiments | Prompt Tuning |
| Production deployment | LoRA (merged) |
Decision Tree¶
Start
│
├─ Is inference latency critical?
│ │
│ Yes ──▶ LoRA (can merge weights)
│ │
│ No ──▶ Continue
│
├─ Is compute extremely limited?
│ │
│ Yes ──▶ Prompt Tuning
│ │
│ No ──▶ Continue
│
├─ Is the task simple (classification, etc.)?
│ │
│ Yes ──▶ Prompt Tuning or LoRA (r=4)
│ │
│ No ──▶ Continue
│
├─ Do you need to combine multiple tasks?
│ │
│ Yes ──▶ Adapters (easy to swap/combine)
│ │
│ No ──▶ Continue
│
└─ Default recommendation ──▶ LoRA (r=8, α=16)
Detailed Comparison¶
Parameter Efficiency¶
| Method | Params (7B model) | Storage |
|---|---|---|
| Prompt Tuning (20 tokens) | ~80K | ~300KB |
| Prefix Tuning (10 per layer) | ~2.6M | ~10MB |
| LoRA (r=8, Q+V) | ~4M | ~16MB |
| Adapters (bottleneck=64) | ~50M | ~200MB |
| Full Fine-Tuning | ~7B | ~28GB |
Quality vs Efficiency Trade-off¶
Quality
│
100%│ ●───────────● Full Fine-tuning
│ ╱ Adapters │
98%│ ●──╱ │
│ │ LoRA │
95%│ │ │
│ │ │
90%│● │ │
│ Prefix │
85%│ │
│● Prompt │
└────────────────────┴────▶
0.001% 0.1% 1% 100%
Parameters (log scale)
Inference Overhead¶
| Method | Extra Compute | Can Merge? |
|---|---|---|
| Prompt Tuning | Minimal (longer sequence) | N/A |
| Prefix Tuning | Small (attention overhead) | No |
| LoRA | None if merged | Yes |
| Adapters | Small (extra layers) | No (needs distillation) |
Task-Specific Recommendations¶
Classification¶
Best: Prompt Tuning or LoRA (r=4)
- Simple task, doesn't need many parameters
- Quick training
- Low risk of overfitting
Instruction Following¶
Best: LoRA (r=8-16)
- Needs moderate capacity
- Benefits from merged deployment
- Well-studied for this use case
Domain Adaptation¶
Best: LoRA (r=16-32) or Adapters
- Need to learn new knowledge
- May need more capacity
- Consider combining with continued pretraining
Style/Persona Transfer¶
Best: LoRA (r=8) or Prefix Tuning
- Mainly steering behavior, not adding knowledge
- Works well with fewer parameters
Code Generation¶
Best: LoRA (r=8-16) on all attention weights
- Complex task
- Benefits from adapting Q, K, V, and O
- May also benefit from FFN adaptation
Combining Methods¶
Sometimes one method isn't enough. You can combine:
LoRA + Prompt Tuning¶
# Learn soft prompts AND low-rank weight updates
soft_prompt = PromptTuning(d_model=4096, prompt_length=20)
lora_layers = add_lora_to_model(model, rank=8)
# Training: optimize both
optimizer = Adam(
list(soft_prompt.parameters()) + list(lora_layers.parameters())
)
Multiple LoRAs¶
Train separate LoRAs for different aspects:
- LoRA A: Style adaptation
- LoRA B: Domain knowledge
- Combine at inference with weighted sum
Scaling Behavior¶
As models get larger, PEFT methods become relatively more efficient:
| Model Size | Full FT Memory | LoRA Memory | Savings |
|---|---|---|---|
| 1B | ~4GB | ~1GB | 4x |
| 7B | ~28GB | ~8GB | 3.5x |
| 70B | ~280GB | ~40GB | 7x |
Larger models benefit more from PEFT!
Practical Recommendations¶
Starting Point¶
1. Try LoRA (r=8, α=16) on Q and V matrices
2. Train for a few epochs
3. Evaluate
If underfitting:
- Increase rank to 16 or 32
- Add more target modules (K, O, FFN)
If overfitting:
- Decrease rank to 4
- Add dropout
- Reduce training epochs
Production Deployment¶
- Train with LoRA
- Validate performance
- Merge weights for deployment
- No inference overhead
Research/Experimentation¶
- Start with Prompt Tuning (fastest)
- Graduate to LoRA if needed
- Use Adapters for multi-task scenarios
Common Pitfalls¶
| Mistake | Why It's Wrong | Fix |
|---|---|---|
| Starting with full FT | Wasteful, often unnecessary | Start with LoRA |
| Rank too high | Overfitting, slow | Start low, increase if needed |
| Wrong target modules | Missing important weights | Start with Q+V, expand |
| Ignoring inference | Adapters add latency | Use LoRA for latency-sensitive apps |
| One-size-fits-all | Tasks have different needs | Tune hyperparams per task |
Summary¶
| Factor | Recommendation |
|---|---|
| Simple task + low compute | Prompt Tuning |
| General purpose | LoRA (r=8) |
| Complex task | LoRA (r=16+) or Adapters |
| Zero latency overhead | LoRA (merged) |
| Multi-task | Adapters or multiple LoRAs |
Default recommendation: Start with LoRA (r=8, α=16) on Q and V projections. Adjust from there.
Key insight: There's no single best method—the right choice depends on your specific constraints and requirements.
Next: We'll implement all these methods from scratch.