Interlude: The Measurement Toolkit

Before we dive into the mathematical properties that enable optimization, we need to establish a crucial skill: the ability to measure performance accurately. You’ll need these tools starting in Part I, not in some later chapter.

The First Principle

Performance optimization follows a simple rule:

Measure, then optimize. Never the reverse.

This sounds obvious, but it’s violated constantly. Developers optimize code based on intuition, past experience, or “common knowledge”—and often make things worse or optimize the wrong thing entirely.

The Three Diagnostic Questions

Every performance investigation starts with three questions:

Where is time spent? → Identify the hotspot
Why is it slow? → Determine the limiting resource (compute, memory, I/O)
What’s the theoretical limit? → Know how close you are and when to stop

Essential Measurement Toolkit

Timing: Do It Right

import time

def benchmark(fn, *args, warmup=3, trials=10):
    """Robust benchmarking pattern."""
    # Warmup: let JIT compile, caches warm, etc.
    for _ in range(warmup):
        fn(*args)

    # Measure multiple trials
    times = []
    for _ in range(trials):
        start = time.perf_counter()
        fn(*args)
        elapsed = time.perf_counter() - start
        times.append(elapsed)

    # Report median (robust to outliers from GC, OS scheduling)
    times.sort()
    return times[len(times) // 2]

Why median, not mean? Performance measurements are typically right-skewed — occasional spikes from garbage collection, thermal throttling, or OS scheduling inflate the mean. The median represents the “typical” run.

GPU Timing: Synchronize!

import torch

def gpu_benchmark(fn, *args, warmup=5, trials=20):
    """GPU benchmarking — must synchronize."""
    for _ in range(warmup):
        fn(*args)
    torch.cuda.synchronize()  # Wait for GPU to finish

    times = []
    for _ in range(trials):
        torch.cuda.synchronize()
        start = time.perf_counter()
        fn(*args)
        torch.cuda.synchronize()  # Don't time async launches!
        elapsed = time.perf_counter() - start
        times.append(elapsed)

    times.sort()
    return times[len(times) // 2]

The #1 GPU Benchmarking Mistake

Without torch.cuda.synchronize(), you measure the time to launch the kernel, not to execute it. GPU operations are asynchronous — torch.matmul() returns immediately while the GPU is still computing. This can make operations appear 100× faster than they actually are.

The Roofline Quick Check

The next chapter (?sec-bandwidth) introduces the roofline model in detail. For now, here’s the essential check:

\[\text{Performance} \leq \min(\text{Peak FLOPS}, \text{Bandwidth} \times \text{Arithmetic Intensity})\]

def roofline_check(measured_tflops, measured_bandwidth_gb_s, arithmetic_intensity):
    """Quick check: are you compute-bound or memory-bound?"""
    # Example: A100 specs
    peak_tflops = 312       # FP16 tensor core
    peak_bw_gb_s = 2000     # HBM bandwidth

    compute_bound = peak_tflops
    memory_bound = peak_bw_gb_s * arithmetic_intensity / 1000  # Convert to TFLOPS

    roofline = min(compute_bound, memory_bound)
    utilization = measured_tflops / roofline

    if arithmetic_intensity < peak_tflops * 1000 / peak_bw_gb_s:
        regime = "MEMORY-BOUND"
    else:
        regime = "COMPUTE-BOUND"

    print(f"Regime: {regime}")
    print(f"Roofline: {roofline:.1f} TFLOPS")
    print(f"Measured: {measured_tflops:.1f} TFLOPS")
    print(f"Utilization: {utilization:.0%}")

If utilization > 50%, you’re doing well. If < 10%, there’s a large gap to investigate.

The Hypothesis Pattern

When performance disappoints, form a hypothesis and test it:

Observation: My matmul achieves 5% of peak
Hypothesis 1: Memory-bound (low arithmetic intensity)
  Test: Calculate AI = FLOPs / bytes. AI = N/6 for matmul. At N=4096, AI=683.
        That's above the ridge point. Not memory-bound.
Hypothesis 2: Poor memory access pattern (uncoalesced)
  Test: Profile with Nsight. Check "Global Load Efficiency."
  Result: 12% efficiency — confirmed! Fix coalescing.

This scientific approach — hypothesis, experiment, analysis — prevents wasted effort on ineffective optimizations. We’ll develop it fully in ?sec-hypothesis.

The Investigation Pattern

Each investigation in this book follows a consistent pattern:

Baseline: Measure the naive implementation
Bound: Calculate the theoretical limit (roofline, algorithmic complexity)
Gap analysis: Why is actual performance << theoretical?
Property audit: Which of the six properties apply?
Derive: The matching property suggests the optimization
Implement and measure: Did it work?
Iterate: New bottleneck? Back to step 3.

Microbenchmark Pitfalls

Microbenchmarks measure small, isolated pieces of code. They’re useful but dangerous:

Warmup matters: First runs include JIT compilation, cache warming, and other one-time costs. Always discard warmup iterations.
Context matters: Code behaves differently in isolation vs. real workloads. A kernel that’s fast alone may be slow when competing for cache with other operations.
Statistics matter: Report medians and percentiles, not single numbers. If results vary by >10% across runs, your measurement setup has a problem.
GPU clocks drift: Thermal throttling can reduce GPU clocks by 10-20% during sustained workloads. Profile steady-state, not burst performance.

Full Methodology and Tools

Part V (?sec-measurement, ?sec-hypothesis) provides comprehensive coverage of measurement techniques and hypothesis-driven debugging. Part VI (?sec-profiling-tools) covers profiling tools (perf, Nsight, PyTorch Profiler) in depth.

The toolkit above is sufficient for all investigations in Parts I through IV. Return to Parts V-VI when you need advanced methodology or tool-specific guidance.