27 The Art of Hypothesis
Finding the Bottleneck Before Optimizing
George Pólya’s first principle of problem-solving: “Understand the problem.”
Before you optimize, you must answer: what is slow, and why?
Getting this wrong means optimizing the wrong thing. Getting this right means the optimization is often obvious.
27.1 The Premature Optimization Trap
Everyone knows Knuth’s quote: “Premature optimization is the root of all evil.”
Fewer people know the full quote: “We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3%.”
The question is: how do you find that critical 3%?
Not by guessing. By forming and testing hypotheses.
27.2 The Diagnostic Questions
Before writing any optimization code, answer these questions:
27.2.1 1. What Is the Bottleneck?
Is your code slow because of:
□ CPU computation (compute-bound)
□ Memory access (memory-bound)
□ I/O operations (I/O-bound)
□ Waiting for something (latency-bound)
□ Synchronization (lock-bound)
These are mutually exclusive. Optimizing CPU computation when you’re memory-bound is waste.
27.2.2 2. Where Is the Time Going?
What percentage of time is spent in each component?
- 80% in matrix multiply → optimize matmul
- 80% in data loading → optimize I/O
- 40% here, 30% there, 30% elsewhere → systemic issue
The 80/20 rule is real: 80% of time is usually in 20% of code.
27.2.3 3. What Resource Is Constrained?
Resource Symptom Check
──────────────────────────────────────────────────
CPU 100% utilization top, htop
GPU compute High SM utilization nvidia-smi, Nsight
GPU memory BW Low SM util, high BW Roofline analysis
Host memory Swapping, high RSS free, vmstat
GPU memory OOM errors nvidia-smi
Disk High iowait iostat
Network High bandwidth, latency iftop, ping
27.2.4 4. What Changed?
If it was fast before and is slow now:
- What code changed?
- What data changed?
- What environment changed?
Often the best hypothesis comes from the diff.
27.3 The Hypothesis-Driven Workflow
27.3.1 Step 1: Observe
Measure without judgment. Just gather data.
# Initial observation
import time
start = time.time()
result = my_function(data)
elapsed = time.time() - start
print(f"Total time: {elapsed:.2f}s")
# This tells us: it's slow
# This doesn't tell us: why27.3.2 Step 2: Hypothesize
Form a specific, testable hypothesis.
Bad hypothesis: "It's slow because it's inefficient."
(Vague, not testable)
Good hypothesis: "It's slow because the inner loop has
cache misses from strided memory access."
(Specific, testable, actionable)
A good hypothesis: - Is specific about what and why - Makes a prediction you can test - Suggests what to measure - Points toward a solution
27.3.3 Step 3: Test
Design an experiment that could falsify your hypothesis.
# Hypothesis: Cache misses from strided access
# Test 1: Profile cache behavior
import subprocess
result = subprocess.run(
['perf', 'stat', '-e', 'cache-misses,cache-references',
'python', 'my_script.py'],
capture_output=True
)
# Check cache miss rate
# Test 2: Compare with sequential access
def sequential_version(data):
# Reorder to access sequentially
return process(data.T.contiguous())
time_strided = benchmark(original_version)
time_sequential = benchmark(sequential_version)
# If sequential is faster, hypothesis supported27.3.4 Step 4: Conclude
Did the evidence support or refute your hypothesis?
Possible outcomes:
1. Hypothesis supported, improvement found
→ Ship it
2. Hypothesis refuted
→ Good! You learned something. Form new hypothesis.
3. Inconclusive
→ Need better measurement. Refine experiment.
27.3.5 Step 5: Iterate
Performance optimization is rarely one shot. Each conclusion leads to new observations.
Observation: Forward pass is slow
Hypothesis 1: Memory-bound on attention
Test: Profile memory bandwidth → Low utilization
Conclusion: Not memory-bound. Refuted.
Hypothesis 2: Compute-bound on FFN
Test: Profile compute utilization → 100% on FFN
Conclusion: Compute-bound. Supported.
Hypothesis 3: FFN is doing redundant work
Test: Examine FFN code → Unnecessary recomputation
Conclusion: Yes! Fix it.
Result: 2× speedup
27.4 Case Study: The Slow Training Loop
Let’s work through a real debugging session.
27.4.1 Observation
# Training loop takes 30 minutes per epoch
# Expected: 10 minutes per epoch
# 3× slower than expected
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()27.4.2 Hypothesis 1: GPU Is Underutilized
Reasoning: Training is usually GPU-bound. If GPU utilization is low, something is starving it.
Test:
watch -n 1 nvidia-smi
# Output:
# GPU-Util: 30% ← Very low!
# Memory: 50%Conclusion: Supported. GPU is only 30% utilized. Something else is the bottleneck.
27.4.3 Hypothesis 2: Data Loading Is the Bottleneck
Reasoning: Low GPU utilization often means CPU can’t feed data fast enough.
Test:
# Profile data loading vs compute
import time
load_times = []
compute_times = []
for batch in dataloader:
start = time.time()
batch = batch.to('cuda')
load_times.append(time.time() - start)
start = time.time()
loss = model(batch)
loss.backward()
optimizer.step()
torch.cuda.synchronize()
compute_times.append(time.time() - start)
print(f"Load: {sum(load_times):.1f}s, Compute: {sum(compute_times):.1f}s")
# Output: Load: 25.3s, Compute: 4.7sConclusion: Supported. Data loading is 5× slower than compute!
27.4.4 Hypothesis 3: DataLoader Workers Are Insufficient
Reasoning: Default is num_workers=0 (single process). Adding workers enables parallel loading.
Test:
# Current
dataloader = DataLoader(dataset, batch_size=32, num_workers=0)
# Time: 30 min/epoch
# Proposed fix
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)
# Time: ???
# Result: 12 min/epoch (2.5× faster)Conclusion: Partially supported. 2.5× faster, but still not 10 min target.
27.4.5 Hypothesis 4: Pin Memory Is Not Enabled
Reasoning: pin_memory=True enables faster CPU→GPU transfer.
Test:
dataloader = DataLoader(
dataset,
batch_size=32,
num_workers=4,
pin_memory=True # ← Add this
)
# Time: 10.5 min/epochConclusion: Supported. Now at target.
27.4.6 Summary
Initial: 30 min/epoch
H1: GPU underutilized → Confirmed
H2: Data loading slow → Confirmed (25s load, 5s compute)
H3: Need more workers → Partial (30 → 12 min)
H4: Enable pin_memory → Confirmed (12 → 10.5 min)
Final: 10.5 min/epoch (2.9× faster)
Total optimization time: ~20 minutes of investigation, 2 lines of code changed.
27.5 The USE Method as Hypothesis Generator
Brendan Gregg’s USE method (Chapter 15) is also a hypothesis generator:
For each resource, check:
Utilization → "Is X being used heavily?"
Saturation → "Is X overwhelmed with queued work?"
Errors → "Is X failing?"
Each check either confirms or rules out that resource as the bottleneck.
Systematic application:
def use_checklist():
"""Generate hypotheses using USE method"""
resources = ['CPU', 'GPU', 'Memory', 'Disk', 'Network']
for resource in resources:
print(f"\n{resource}:")
print(f" Utilization: [measure % busy]")
print(f" Saturation: [measure queue depth]")
print(f" Errors: [measure error count]")
# High utilization + high saturation → bottleneck
# High utilization + low saturation → well-utilized
# Low utilization → not the bottleneck27.6 Common Hypothesis Patterns
27.6.1 Pattern: High Latency, Low Utilization
Symptom: Operation is slow, but CPU/GPU utilization is low
Diagnosis: Waiting for something
Common causes:
- I/O (disk, network)
- Synchronization (locks, barriers)
- Memory stalls (cache misses)
- Kernel launch overhead (GPU)
Hypothesis template:
"The operation is slow because it's waiting for [X]."
27.6.2 Pattern: High Utilization, High Latency
Symptom: CPU/GPU at 100%, but still slow
Diagnosis: Actually compute-bound (doing too much work)
Common causes:
- Algorithm complexity too high
- Redundant computation
- No caching of repeated work
- Wrong precision (FP64 instead of FP16)
Hypothesis template:
"The operation is compute-bound because [X]."
27.6.3 Pattern: Memory Usage Grows Over Time
Symptom: Memory increases during execution
Diagnosis: Memory leak or accumulation
Common causes:
- Accumulating tensors in list
- Computation graph not freed (loss.item() not called)
- Cache not bounded
Hypothesis template:
"Memory is accumulating because [X] is not being freed."
27.6.4 Pattern: GPU Memory Full, Low Compute
Symptom: GPU memory near limit, but SM utilization low
Diagnosis: Memory-bound or wrong batch size
Common causes:
- Batch size too small (underutilizing compute)
- Memory-bound kernels (need algorithmic change)
- Fragmentation (need memory pool)
Hypothesis template:
"GPU is underutilized because [memory/batch size issue]."
27.7 The Roofline as Hypothesis Framework
The roofline model (Chapter 2) provides a structured way to generate hypotheses:
Peak Compute ─────────────────────────────────
╱
╱
╱ Memory-bound region
╱
╱
─────────╱───────────────────────────────────────
↑
Ridge point (balance point)
Your operation is either:
- Left of ridge: memory-bound → optimize memory access
- Right of ridge: compute-bound → optimize computation
- Below roofline: inefficient → optimize implementation
This immediately suggests hypotheses:
def roofline_hypothesis(measured_flops, measured_bandwidth, peak_flops, peak_bw):
"""Generate hypothesis from roofline position"""
ridge_point = peak_flops / peak_bw # FLOPs/byte
arithmetic_intensity = measured_flops / measured_bandwidth
if arithmetic_intensity < ridge_point:
return ("Memory-bound. Hypotheses:\n"
" 1. Improve data reuse (tiling)\n"
" 2. Reduce data movement (fusion)\n"
" 3. Use lower precision (quantization)")
else:
return ("Compute-bound. Hypotheses:\n"
" 1. Reduce operations (algorithm change)\n"
" 2. Use specialized hardware (tensor cores)\n"
" 3. Increase parallelism")27.8 When Hypotheses Fail
Sometimes your hypothesis is wrong. That’s good—you learned something.
27.8.1 Wrong Hypothesis: “It’s Compute-Bound”
# Hypothesis: Matrix multiply is slow because of compute
# Test: Profile compute utilization
# Result: 20% GPU utilization
# Hypothesis refuted. It's NOT compute-bound.
# New hypothesis needed.27.8.2 Wrong Hypothesis: “It’s a Slow Function”
# Hypothesis: slow_function() is the bottleneck
# Test: Remove slow_function(), measure total time
# Result: No change (!)
# Hypothesis refuted. slow_function() isn't on critical path.
# Maybe it's running in parallel with something else.27.8.3 Wrong Hypothesis: “The Fix Will Help”
# Hypothesis: Vectorization will speed up the loop
# Test: Vectorize and benchmark
# Result: 2% slower (!)
# Hypothesis refuted. Why?
# Investigation: Vectorized version uses more memory,
# causes cache eviction, memory becomes bottleneck.Each failed hypothesis teaches you about the system.
27.9 The Art of Good Hypotheses
27.9.1 Be Specific
Bad: "It's slow."
Good: "The forward pass is slow because attention is recomputing
softmax for cached values."
27.9.2 Be Quantitative
Bad: "Memory is high."
Good: "Memory usage is 15GB, expected 8GB. The 7GB excess is
in the optimizer states (checked with memory snapshot)."
27.9.3 Be Mechanistic
Bad: "The GPU is slow."
Good: "The GPU is underutilized because kernel launches are
serialized through a single CUDA stream."
27.9.4 Be Falsifiable
Bad: "The code is inefficient."
(How would you prove this wrong?)
Good: "The code is doing 2× more FLOPs than necessary due to
recomputation of X."
(Count the FLOPs, compare to optimal.)
27.10 Key Takeaways
Understand before optimizing: The wrong optimization is worse than no optimization. Find the bottleneck first.
Form specific hypotheses: Vague hunches aren’t testable. “It’s memory-bound because X” is testable.
Test to falsify: Design experiments that could prove you wrong. That’s how you learn.
Iterate: Each test teaches you something. Use it to form better hypotheses.
Use frameworks: USE method, roofline model, and systematic checklists generate hypotheses you might miss.
The best performance engineers aren’t the ones who optimize the fastest. They’re the ones who find the right thing to optimize.
27.11 Further Reading
- Pólya (1945). “How to Solve It”
- Gregg (2020). “Systems Performance” (Chapter 2: Methodologies)
- Hennessy & Patterson (2017). “Computer Architecture: A Quantitative Approach”
- The Scientific Method (yes, that one from school)