Section 9.5: Choosing a Method¶

Reading time: 8 minutes

The Decision Framework¶

With multiple PEFT methods available, how do you choose? Consider these factors:

Task complexity
Available compute
Inference requirements
Data size

Quick Reference¶

If You Need...	Use
Maximum efficiency	Prompt Tuning
Balance of quality and efficiency	LoRA
Maximum quality (near full fine-tuning)	Adapters or LoRA with high rank
Zero inference overhead	LoRA (merged)
Quick experiments	Prompt Tuning
Production deployment	LoRA (merged)

Decision Tree¶

Start
  │
  ├─ Is inference latency critical?
  │    │
  │    Yes ──▶ LoRA (can merge weights)
  │    │
  │    No ──▶ Continue
  │
  ├─ Is compute extremely limited?
  │    │
  │    Yes ──▶ Prompt Tuning
  │    │
  │    No ──▶ Continue
  │
  ├─ Is the task simple (classification, etc.)?
  │    │
  │    Yes ──▶ Prompt Tuning or LoRA (r=4)
  │    │
  │    No ──▶ Continue
  │
  ├─ Do you need to combine multiple tasks?
  │    │
  │    Yes ──▶ Adapters (easy to swap/combine)
  │    │
  │    No ──▶ Continue
  │
  └─ Default recommendation ──▶ LoRA (r=8, α=16)

Detailed Comparison¶

Parameter Efficiency¶

Method	Params (7B model)	Storage
Prompt Tuning (20 tokens)	~80K	~300KB
Prefix Tuning (10 per layer)	~2.6M	~10MB
LoRA (r=8, Q+V)	~4M	~16MB
Adapters (bottleneck=64)	~50M	~200MB
Full Fine-Tuning	~7B	~28GB

Quality vs Efficiency Trade-off¶

Quality
    │
100%│        ●───────────●  Full Fine-tuning
    │       ╱  Adapters  │
 98%│   ●──╱             │
    │   │ LoRA           │
 95%│   │                │
    │   │                │
 90%│●  │                │
    │ Prefix             │
 85%│                    │
    │●  Prompt           │
    └────────────────────┴────▶
        0.001%  0.1%  1%  100%
              Parameters (log scale)

Inference Overhead¶

Method	Extra Compute	Can Merge?
Prompt Tuning	Minimal (longer sequence)	N/A
Prefix Tuning	Small (attention overhead)	No
LoRA	None if merged	Yes
Adapters	Small (extra layers)	No (needs distillation)

Task-Specific Recommendations¶

Classification¶

Best: Prompt Tuning or LoRA (r=4)

Simple task, doesn't need many parameters
Quick training
Low risk of overfitting

Instruction Following¶

Best: LoRA (r=8-16)

Needs moderate capacity
Benefits from merged deployment
Well-studied for this use case

Domain Adaptation¶

Best: LoRA (r=16-32) or Adapters

Need to learn new knowledge
May need more capacity
Consider combining with continued pretraining

Style/Persona Transfer¶

Best: LoRA (r=8) or Prefix Tuning

Mainly steering behavior, not adding knowledge
Works well with fewer parameters

Code Generation¶

Best: LoRA (r=8-16) on all attention weights

Complex task
Benefits from adapting Q, K, V, and O
May also benefit from FFN adaptation

Combining Methods¶

Sometimes one method isn't enough. You can combine:

LoRA + Prompt Tuning¶

# Learn soft prompts AND low-rank weight updates
soft_prompt = PromptTuning(d_model=4096, prompt_length=20)
lora_layers = add_lora_to_model(model, rank=8)

# Training: optimize both
optimizer = Adam(
    list(soft_prompt.parameters()) + list(lora_layers.parameters())
)

Multiple LoRAs¶

Train separate LoRAs for different aspects:

LoRA A: Style adaptation
LoRA B: Domain knowledge
Combine at inference with weighted sum

Scaling Behavior¶

As models get larger, PEFT methods become relatively more efficient:

Model Size	Full FT Memory	LoRA Memory	Savings
1B	~4GB	~1GB	4x
7B	~28GB	~8GB	3.5x
70B	~280GB	~40GB	7x

Larger models benefit more from PEFT!

Practical Recommendations¶

Starting Point¶

1. Try LoRA (r=8, α=16) on Q and V matrices
2. Train for a few epochs
3. Evaluate

If underfitting:
  - Increase rank to 16 or 32
  - Add more target modules (K, O, FFN)

If overfitting:
  - Decrease rank to 4
  - Add dropout
  - Reduce training epochs

Production Deployment¶

Train with LoRA
Validate performance
Merge weights for deployment
No inference overhead

Research/Experimentation¶

Start with Prompt Tuning (fastest)
Graduate to LoRA if needed
Use Adapters for multi-task scenarios

Common Pitfalls¶

Mistake	Why It's Wrong	Fix
Starting with full FT	Wasteful, often unnecessary	Start with LoRA
Rank too high	Overfitting, slow	Start low, increase if needed
Wrong target modules	Missing important weights	Start with Q+V, expand
Ignoring inference	Adapters add latency	Use LoRA for latency-sensitive apps
One-size-fits-all	Tasks have different needs	Tune hyperparams per task

Summary¶

Factor	Recommendation
Simple task + low compute	Prompt Tuning
General purpose	LoRA (r=8)
Complex task	LoRA (r=16+) or Adapters
Zero latency overhead	LoRA (merged)
Multi-task	Adapters or multiple LoRAs

Default recommendation: Start with LoRA (r=8, α=16) on Q and V projections. Adjust from there.

Key insight: There's no single best method—the right choice depends on your specific constraints and requirements.

Next: We'll implement all these methods from scratch.