Interlude: The Measurement Toolkit
Before we dive into the mathematical properties that enable optimization, we need to establish a crucial skill: the ability to measure performance accurately. You’ll need these tools starting in Part I, not in some later chapter.
The First Principle
Performance optimization follows a simple rule:
Measure, then optimize. Never the reverse.
This sounds obvious, but it’s violated constantly. Developers optimize code based on intuition, past experience, or “common knowledge”—and often make things worse or optimize the wrong thing entirely.
The Three Diagnostic Questions
Every performance investigation starts with three questions:
- Where is time spent? → Identify the hotspot
- Why is it slow? → Determine the limiting resource (compute, memory, I/O)
- What’s the theoretical limit? → Know how close you are and when to stop
Essential Measurement Toolkit
Timing: Do It Right
import time
def benchmark(fn, *args, warmup=3, trials=10):
"""Robust benchmarking pattern."""
# Warmup: let JIT compile, caches warm, etc.
for _ in range(warmup):
fn(*args)
# Measure multiple trials
times = []
for _ in range(trials):
start = time.perf_counter()
fn(*args)
elapsed = time.perf_counter() - start
times.append(elapsed)
# Report median (robust to outliers from GC, OS scheduling)
times.sort()
return times[len(times) // 2]Why median, not mean? Performance measurements are typically right-skewed — occasional spikes from garbage collection, thermal throttling, or OS scheduling inflate the mean. The median represents the “typical” run.
GPU Timing: Synchronize!
import torch
def gpu_benchmark(fn, *args, warmup=5, trials=20):
"""GPU benchmarking — must synchronize."""
for _ in range(warmup):
fn(*args)
torch.cuda.synchronize() # Wait for GPU to finish
times = []
for _ in range(trials):
torch.cuda.synchronize()
start = time.perf_counter()
fn(*args)
torch.cuda.synchronize() # Don't time async launches!
elapsed = time.perf_counter() - start
times.append(elapsed)
times.sort()
return times[len(times) // 2]Without torch.cuda.synchronize(), you measure the time to launch the kernel, not to execute it. GPU operations are asynchronous — torch.matmul() returns immediately while the GPU is still computing. This can make operations appear 100× faster than they actually are.
The Roofline Quick Check
The next chapter (?sec-bandwidth) introduces the roofline model in detail. For now, here’s the essential check:
\[\text{Performance} \leq \min(\text{Peak FLOPS}, \text{Bandwidth} \times \text{Arithmetic Intensity})\]
def roofline_check(measured_tflops, measured_bandwidth_gb_s, arithmetic_intensity):
"""Quick check: are you compute-bound or memory-bound?"""
# Example: A100 specs
peak_tflops = 312 # FP16 tensor core
peak_bw_gb_s = 2000 # HBM bandwidth
compute_bound = peak_tflops
memory_bound = peak_bw_gb_s * arithmetic_intensity / 1000 # Convert to TFLOPS
roofline = min(compute_bound, memory_bound)
utilization = measured_tflops / roofline
if arithmetic_intensity < peak_tflops * 1000 / peak_bw_gb_s:
regime = "MEMORY-BOUND"
else:
regime = "COMPUTE-BOUND"
print(f"Regime: {regime}")
print(f"Roofline: {roofline:.1f} TFLOPS")
print(f"Measured: {measured_tflops:.1f} TFLOPS")
print(f"Utilization: {utilization:.0%}")If utilization > 50%, you’re doing well. If < 10%, there’s a large gap to investigate.
The Hypothesis Pattern
When performance disappoints, form a hypothesis and test it:
Observation: My matmul achieves 5% of peak
Hypothesis 1: Memory-bound (low arithmetic intensity)
Test: Calculate AI = FLOPs / bytes. AI = N/6 for matmul. At N=4096, AI=683.
That's above the ridge point. Not memory-bound.
Hypothesis 2: Poor memory access pattern (uncoalesced)
Test: Profile with Nsight. Check "Global Load Efficiency."
Result: 12% efficiency — confirmed! Fix coalescing.
This scientific approach — hypothesis, experiment, analysis — prevents wasted effort on ineffective optimizations. We’ll develop it fully in ?sec-hypothesis.
The Investigation Pattern
Each investigation in this book follows a consistent pattern:
- Baseline: Measure the naive implementation
- Bound: Calculate the theoretical limit (roofline, algorithmic complexity)
- Gap analysis: Why is actual performance << theoretical?
- Property audit: Which of the six properties apply?
- Derive: The matching property suggests the optimization
- Implement and measure: Did it work?
- Iterate: New bottleneck? Back to step 3.
Microbenchmark Pitfalls
Microbenchmarks measure small, isolated pieces of code. They’re useful but dangerous:
- Warmup matters: First runs include JIT compilation, cache warming, and other one-time costs. Always discard warmup iterations.
- Context matters: Code behaves differently in isolation vs. real workloads. A kernel that’s fast alone may be slow when competing for cache with other operations.
- Statistics matter: Report medians and percentiles, not single numbers. If results vary by >10% across runs, your measurement setup has a problem.
- GPU clocks drift: Thermal throttling can reduce GPU clocks by 10-20% during sustained workloads. Profile steady-state, not burst performance.
Part V (?sec-measurement, ?sec-hypothesis) provides comprehensive coverage of measurement techniques and hypothesis-driven debugging. Part VI (?sec-profiling-tools) covers profiling tools (perf, Nsight, PyTorch Profiler) in depth.
The toolkit above is sufficient for all investigations in Parts I through IV. Return to Parts V-VI when you need advanced methodology or tool-specific guidance.