28 The Art of Hypothesis

Finding the Bottleneck Before Optimizing

George Pólya’s first principle of problem-solving: “Understand the problem.”

Before you optimize, you must answer: what is slow, and why?

Getting this wrong means optimizing the wrong thing. Getting this right means the optimization is often obvious.

28.1 The Diagnostic Imperative

Richard Feynman described his approach to problem-solving: “The first principle is that you must not fool yourself—and you are the easiest person to fool.”

Performance optimization is particularly susceptible to self-deception. We think we know what’s slow. We’re usually wrong.

The question isn’t “what should I optimize?” but “what is actually the bottleneck?” Not by guessing. By forming and testing hypotheses.

28.2 The Diagnostic Questions

Before writing any optimization code, answer these questions:

28.2.1 1. What Is the Bottleneck?

Is your code slow because of:
  □ CPU computation (compute-bound)
  □ Memory access (memory-bound)
  □ I/O operations (I/O-bound)
  □ Waiting for something (latency-bound)
  □ Synchronization (lock-bound)

These are common primary bottlenecks, but more than one can apply. Optimizing CPU computation when you’re memory-bound is waste.

28.2.2 2. Where Is the Time Going?

What percentage of time is spent in each component?
  - 80% in matrix multiply → optimize matmul
  - 80% in data loading → optimize I/O
  - 40% here, 30% there, 30% elsewhere → systemic issue

The 80/20 rule is real: 80% of time is usually in 20% of code.

28.2.3 3. What Resource Is Constrained?

  Resource        Symptom                    Check
  ──────────────────────────────────────────────────
  CPU             100% utilization           top, htop
  GPU compute     High SM utilization        nvidia-smi, Nsight
  GPU memory BW   Low SM util, high BW       Roofline analysis
  Host memory     Swapping, high RSS         free, vmstat
  GPU memory      OOM errors                 nvidia-smi
  Disk            High iowait                iostat
  Network         High bandwidth, latency    iftop, ping

28.2.4 4. What Changed?

If it was fast before and is slow now:
  - What code changed?
  - What data changed?
  - What environment changed?

Often the best hypothesis comes from the diff.

28.3 The Hypothesis-Driven Workflow

28.3.1 Step 1: Observe

Measure without judgment. Just gather data.

# Initial observation
import time

start = time.time()
result = my_function(data)
elapsed = time.time() - start
print(f"Total time: {elapsed:.2f}s")

# This tells us: it's slow
# This doesn't tell us: why

28.3.2 Step 2: Hypothesize

Form a specific, testable hypothesis.

Bad hypothesis:  "It's slow because it's inefficient."
                 (Vague, not testable)

Good hypothesis: "It's slow because the inner loop has
                  cache misses from strided memory access."
                 (Specific, testable, actionable)

A good hypothesis: - Is specific about what and why - Makes a prediction you can test - Suggests what to measure - Points toward a solution

28.3.3 Step 3: Test

Design an experiment that could falsify your hypothesis.

# Hypothesis: Cache misses from strided access

# Test 1: Profile cache behavior
import subprocess
result = subprocess.run(
    ['perf', 'stat', '-e', 'cache-misses,cache-references',
     'python', 'my_script.py'],
    capture_output=True
)
# Check cache miss rate

# Test 2: Compare with sequential access
def sequential_version(data):
    # Reorder to access sequentially
    return process(data.T.contiguous())

time_strided = benchmark(original_version)
time_sequential = benchmark(sequential_version)
# If sequential is faster, hypothesis supported

28.3.4 Step 4: Conclude

Did the evidence support or refute your hypothesis?

Possible outcomes:

1. Hypothesis supported, improvement found
   → Ship it

2. Hypothesis refuted
   → Good! You learned something. Form new hypothesis.

3. Inconclusive
   → Need better measurement. Refine experiment.

28.3.5 Step 5: Iterate

Performance optimization is rarely one shot. Each conclusion leads to new observations.

Observation: Forward pass is slow
Hypothesis 1: Memory-bound on attention
Test: Profile memory bandwidth → Low utilization
Conclusion: Not memory-bound. Refuted.

Hypothesis 2: Compute-bound on FFN
Test: Profile compute utilization → 100% on FFN
Conclusion: Compute-bound. Supported.

Hypothesis 3: FFN is doing redundant work
Test: Examine FFN code → Unnecessary recomputation
Conclusion: Yes! Fix it.

Result: 2× speedup

28.4 Case Study: The Slow Training Loop

Let’s work through a real debugging session.

28.4.1 Observation

# Training loop takes 30 minutes per epoch
# Expected: 10 minutes per epoch
# 3× slower than expected

for batch in dataloader:
    loss = model(batch)
    loss.backward()
    optimizer.step()

28.4.2 Hypothesis 1: GPU Is Underutilized

Reasoning: Training is usually GPU-bound. If GPU utilization is low, something is starving it.

Test:

watch -n 1 nvidia-smi

# Output:
# GPU-Util: 30%  ← Very low!
# Memory:   50%

Conclusion: Supported. GPU is only 30% utilized. Something else is the bottleneck.

28.4.3 Hypothesis 2: Data Loading Is the Bottleneck

Reasoning: Low GPU utilization often means CPU can’t feed data fast enough.

Test:

# Profile data loading vs compute
import time

load_times = []
compute_times = []

for batch in dataloader:
    start = time.time()
    batch = batch.to('cuda')
    load_times.append(time.time() - start)

    start = time.time()
    loss = model(batch)
    loss.backward()
    optimizer.step()
    torch.cuda.synchronize()
    compute_times.append(time.time() - start)

print(f"Load: {sum(load_times):.1f}s, Compute: {sum(compute_times):.1f}s")
# Output: Load: 25.3s, Compute: 4.7s

Conclusion: Supported. Data loading is 5× slower than compute!

28.4.4 Hypothesis 3: DataLoader Workers Are Insufficient

Reasoning: Default is num_workers=0 (single process). Adding workers enables parallel loading.

Test:

# Current
dataloader = DataLoader(dataset, batch_size=32, num_workers=0)
# Time: 30 min/epoch

# Proposed fix
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)
# Time: measure on your hardware

# Result: 12 min/epoch (2.5× faster)

Conclusion: Partially supported. 2.5× faster, but still not 10 min target.

28.4.5 Hypothesis 4: Pin Memory Is Not Enabled

Reasoning: pin_memory=True enables faster CPU→GPU transfer.

Test:

dataloader = DataLoader(
    dataset,
    batch_size=32,
    num_workers=4,
    pin_memory=True  # ← Add this
)
# Time: 10.5 min/epoch

Conclusion: Supported. Now at target.

28.4.6 Summary

Initial: 30 min/epoch

H1: GPU underutilized → Confirmed
H2: Data loading slow → Confirmed (25s load, 5s compute)
H3: Need more workers → Partial (30 → 12 min)
H4: Enable pin_memory → Confirmed (12 → 10.5 min)

Final: 10.5 min/epoch (2.9× faster)

Total optimization time: ~20 minutes of investigation, 2 lines of code changed.

28.5 The USE Method as Hypothesis Generator

Brendan Gregg’s USE method (Measurement) is also a hypothesis generator:

For each resource, check:

Utilization → "Is X being used heavily?"
Saturation  → "Is X overwhelmed with queued work?"
Errors      → "Is X failing?"

Each check either confirms or rules out that resource as the bottleneck.

Systematic application:

def use_checklist():
    """Generate hypotheses using USE method"""
    resources = ['CPU', 'GPU', 'Memory', 'Disk', 'Network']

    for resource in resources:
        print(f"\n{resource}:")
        print(f"  Utilization: [measure % busy]")
        print(f"  Saturation:  [measure queue depth]")
        print(f"  Errors:      [measure error count]")

        # High utilization + high saturation → bottleneck
        # High utilization + low saturation → well-utilized
        # Low utilization → not the bottleneck

28.6 Common Hypothesis Patterns

28.6.1 Pattern: High Latency, Low Utilization

Symptom:   Operation is slow, but CPU/GPU utilization is low
Diagnosis: Waiting for something

Common causes:
  - I/O (disk, network)
  - Synchronization (locks, barriers)
  - Memory stalls (cache misses)
  - Kernel launch overhead (GPU)

Hypothesis template:
  "The operation is slow because it's waiting for [X]."

28.6.2 Pattern: High Utilization, High Latency

Symptom:   CPU/GPU at 100%, but still slow
Diagnosis: Actually compute-bound (doing too much work)

Common causes:
  - Algorithm complexity too high
  - Redundant computation
  - No caching of repeated work
  - Wrong precision (FP64 instead of FP16)

Hypothesis template:
  "The operation is compute-bound because [X]."

28.6.3 Pattern: Memory Usage Grows Over Time

Symptom:   Memory increases during execution
Diagnosis: Memory leak or accumulation

Common causes:
  - Accumulating tensors in list
  - Computation graph not freed (loss.item() not called)
  - Cache not bounded

Hypothesis template:
  "Memory is accumulating because [X] is not being freed."

28.6.4 Pattern: GPU Memory Full, Low Compute

Symptom:   GPU memory near limit, but SM utilization low
Diagnosis: Memory-bound or wrong batch size

Common causes:
  - Batch size too small (underutilizing compute)
  - Memory-bound kernels (need algorithmic change)
  - Fragmentation (need memory pool)

Hypothesis template:
  "GPU is underutilized because [memory/batch size issue]."

28.7 The Roofline as Hypothesis Framework

The roofline model (from Bandwidth) provides a structured way to generate hypotheses:

          Peak Compute ─────────────────────────────────
                       ╱
                      ╱
                     ╱ Memory-bound region
                    ╱
                   ╱
        ─────────╱───────────────────────────────────────
                ↑
          Ridge point (balance point)

Your operation is either:
  - Left of ridge: memory-bound → optimize memory access
  - Right of ridge: compute-bound → optimize computation
  - Below roofline: inefficient → optimize implementation

This immediately suggests hypotheses:

def roofline_hypothesis(measured_flops, measured_bandwidth, peak_flops, peak_bw):
    """Generate hypothesis from roofline position"""
    ridge_point = peak_flops / peak_bw  # FLOPs/byte

    arithmetic_intensity = measured_flops / measured_bandwidth

    if arithmetic_intensity < ridge_point:
        return ("Memory-bound. Hypotheses:\n"
                "  1. Improve data reuse (tiling)\n"
                "  2. Reduce data movement (fusion)\n"
                "  3. Use lower precision (quantization)")
    else:
        return ("Compute-bound. Hypotheses:\n"
                "  1. Reduce operations (algorithm change)\n"
                "  2. Use specialized hardware (tensor cores)\n"
                "  3. Increase parallelism")

28.8 When Hypotheses Fail

Sometimes your hypothesis is wrong. That’s good—you learned something.

28.8.1 Wrong Hypothesis: “It’s Compute-Bound”

# Hypothesis: Matrix multiply is slow because of compute
# Test: Profile compute utilization
# Result: 20% GPU utilization

# Hypothesis refuted. It's NOT compute-bound.
# New hypothesis needed.

28.8.2 Wrong Hypothesis: “It’s a Slow Function”

# Hypothesis: slow_function() is the bottleneck
# Test: Remove slow_function(), measure total time
# Result: No change (!)

# Hypothesis refuted. slow_function() isn't on critical path.
# Maybe it's running in parallel with something else.

28.8.3 Wrong Hypothesis: “The Fix Will Help”

# Hypothesis: Vectorization will speed up the loop
# Test: Vectorize and benchmark
# Result: 2% slower (!)

# Hypothesis refuted. Why?
# Investigation: Vectorized version uses more memory,
# causes cache eviction, memory becomes bottleneck.

Each failed hypothesis teaches you about the system.

28.9 When Models Lie: Hidden Variables

The roofline model predicts performance from arithmetic intensity. Amdahl’s Law predicts scaling from serial fraction. These models are useful—and incomplete.

When measured performance differs from prediction, the gap isn’t magic. It’s a hidden variable—something your model doesn’t account for.

28.9.1 Categories of Hidden Variables

Hardware hidden variables: The machine doesn’t behave as spec’d.

Thermal throttling reduces sustained performance below peak burst
NUMA places memory on different sockets—“same speed” memory isn’t
Cache associativity creates conflicts even when total capacity suffices
Power limits throttle before compute limits on sustained workloads

System hidden variables: The software stack embeds assumptions.

The OS scheduler moves threads between cores mid-computation
The memory allocator fragments the heap over time
Congestion control enforces “fairness” between flows that don’t want fairness
Garbage collection pauses interrupt latency-sensitive paths

Workload hidden variables: Your workload differs from the model’s assumptions.

Your access pattern is random where the model assumed sequential
Your data distribution triggers edge cases (all zeros, extreme values)
Your parallelism creates contention the model assumed away
Your batch size is too small to amortize overhead

28.9.2 The Diagnostic Method: Differential Diagnosis

When your hypothesis fails, isolate the hidden variable:

1. Change one variable
2. Measure
3. Did the gap shrink?
   - Yes → You've found a hidden variable
   - No  → That wasn't it; try another

Repeat until the gap is explained.

Example: Predicted 10× speedup from 10 threads, measured 3×.

Test 1: Pin threads to cores
        Result: 3.5× (small improvement)
        Conclusion: Scheduling is a factor, but not the main one

Test 2: Reduce data size to fit in L3 cache
        Result: 8× (big improvement!)
        Conclusion: Memory bandwidth saturation is the hidden variable

Fix: Reduce per-thread data footprint or accept bandwidth limit

28.9.3 The Deeper Lesson

Hidden variables aren’t bugs—they’re mismatches between your mental model and reality. Every model simplifies. The skill is knowing which simplifications break for your specific situation.

The six properties (associativity, separability, locality, sparsity, redundancy, symmetry) are visible in the math. Hidden variables are invisible in the math but present in the system. Both matter for performance.

When you encounter a counter-intuitive result—parallelism that slows things down, caching that hurts, optimization that backfires—ask: what assumption does my model make that reality violates?

The answer is usually a hidden variable.

28.10 The Art of Good Hypotheses

28.10.1 Be Specific

Bad:  "It's slow."
Good: "The forward pass is slow because attention is recomputing
       softmax for cached values."

28.10.2 Be Quantitative

Bad:  "Memory is high."
Good: "Memory usage is 15GB, expected 8GB. The 7GB excess is
       in the optimizer states (checked with memory snapshot)."

28.10.3 Be Mechanistic

Bad:  "The GPU is slow."
Good: "The GPU is underutilized because kernel launches are
       serialized through a single CUDA stream."

28.10.4 Be Falsifiable

Bad:  "The code is inefficient."
       (How would you prove this wrong?)

Good: "The code is doing 2× more FLOPs than necessary due to
       recomputation of X."
       (Count the FLOPs, compare to optimal.)

28.11 Key Takeaways

Understand before optimizing: The wrong optimization is worse than no optimization. Find the bottleneck first.
Form specific hypotheses: Vague hunches aren’t testable. “It’s memory-bound because X” is testable.
Test to falsify: Design experiments that could prove you wrong. That’s how you learn.
Iterate: Each test teaches you something. Use it to form better hypotheses.
Use frameworks: USE method, roofline model, and systematic checklists generate hypotheses you might miss.

The best performance engineers aren’t the ones who optimize the fastest. They’re the ones who find the right thing to optimize.

Try It Yourself

The accompanying notebook lets you:

Practice the hypothesis-driven workflow on sample problems
Use the USE method checklist
Analyze roofline position and generate hypotheses
Debug simulated performance issues

Open In Colab

28.12 Further Reading

Pólya (1945). “How to Solve It”
Gregg (2020). “Systems Performance” (Chapter 2: Methodologies)
Hennessy & Patterson (2017). “Computer Architecture: A Quantitative Approach”
The Scientific Method (yes, that one from school)