27 The Art of Measurement
How to Know What’s Actually Happening
“If you can’t measure it, you can’t improve it.” — Peter Drucker
But what Drucker didn’t say: if you measure the wrong thing, or measure it wrongly, you’ll improve the wrong thing.
Measurement is the foundation of performance engineering. Get it wrong, and nothing else matters.
27.1 The Streetlight Effect
A classic parable:
A police officer finds a drunk looking for his keys under a streetlight. “Where did you drop them?” the officer asks. “Over in the park,” the drunk replies. “Then why are you looking here?” “Because this is where the light is.”
Performance optimization has the same trap. We measure what’s easy to measure, not what matters.
# The streetlight measurement
import time
start = time.time()
result = expensive_operation()
elapsed = time.time() - start
print(f"Operation took {elapsed:.3f}s")This measures wall clock time. But is the operation slow because: - It’s compute-bound? (CPU is the bottleneck) - It’s memory-bound? (waiting for data) - It’s I/O-bound? (waiting for disk/network) - Something else is running on the machine?
Wall clock time doesn’t tell you. You’re looking where the light is.
27.2 What Profilers Actually Measure
Different profilers measure different things. Understanding what each measures is critical.
27.2.1 Sampling Profilers
How they work: Periodically interrupt the program, record where it is, build a histogram.
Time: |----x-------x-------x-------x-------x-------|
↓ ↓ ↓ ↓ ↓
function A A B A C
Histogram: A: 60%, B: 20%, C: 20%
What they measure: Where the CPU is, not how long things take.
Artifacts: - Statistical noise (small samples) - Can miss short functions (never sampled) - Measures wall time, not CPU time (includes waits)
Tools: perf (Linux), Instruments (macOS), py-spy (Python)
# Linux perf
perf record -g ./my_program
perf report
# Python py-spy
py-spy top --pid 12345
py-spy record -o profile.svg -- python script.py27.2.2 Instrumentation Profilers
How they work: Insert timing code at function entry/exit.
def profiled_function():
__profiler_enter("profiled_function")
# ... actual code ...
__profiler_exit("profiled_function")What they measure: Exact time spent in each function (including overhead).
Artifacts: - Overhead (each call adds microseconds) - Can change behavior (inlining prevented, caches polluted) - Very accurate for coarse-grained measurements
Tools: cProfile (Python), gprof (C/C++)
import cProfile
cProfile.run('my_function()', sort='cumulative')27.2.3 Hardware Counters
How they work: Read CPU’s built-in performance monitoring unit (PMU).
perf stat ./my_program
# Output:
# 2,000,000,000 cycles
# 1,000,000,000 instructions (0.5 IPC)
# 100,000,000 cache-misses
# 1,000,000 branch-missesWhat they measure: Actual hardware events—cycles, cache misses, branch mispredictions.
Why this matters: These reveal the cause of slowness, not just the symptom.
Low IPC (instructions per cycle)?
→ Memory-bound or stalled on something
High cache miss rate?
→ Data layout or access pattern problem
High branch misprediction?
→ Unpredictable control flow
Tools: perf stat, Intel VTune, AMD uProf
27.2.4 GPU Profilers
GPU profiling is different. Key metrics:
Compute utilization: Are the cores busy?
Memory bandwidth: Are we saturating HBM?
Occupancy: How many warps active vs possible?
Kernel time breakdown: Where does time go?
Tools: NVIDIA Nsight, PyTorch Profiler, nvprof
# PyTorch profiler
import torch.profiler
with torch.profiler.profile(
activities=[
torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA,
]
) as prof:
model(input)
print(prof.key_averages().table(sort_by="cuda_time_total"))27.3 Statistical Validity
Performance measurements are noisy. You’re measuring a random variable, not a constant.
27.3.1 Sources of Noise
System noise:
- Other processes
- Kernel scheduling
- Power management
- Thermal throttling
Measurement noise:
- Timer resolution
- Profiler overhead
- JIT compilation warmup
Data-dependent variation:
- Cache state
- Branch prediction history
- Input distribution
27.3.2 How Many Runs?
One measurement is not enough. But how many?
import numpy as np
from scipy import stats
def benchmark(fn, n_runs=30):
times = []
for _ in range(n_runs):
start = time.perf_counter()
fn()
elapsed = time.perf_counter() - start
times.append(elapsed)
times = np.array(times)
return {
'mean': np.mean(times),
'std': np.std(times),
'median': np.median(times),
'min': np.min(times),
'max': np.max(times),
'ci_95': stats.t.interval(0.95, len(times)-1,
loc=np.mean(times),
scale=stats.sem(times))
}Rule of thumb: 30 runs gives reasonable confidence intervals.
27.3.3 Which Statistic?
Mean, median, or minimum?
Mean: Sensitive to outliers. Use for throughput-focused analysis.
Median: Robust to outliers. Use for latency-focused analysis.
Minimum: "Best case" (least noise). Controversial but useful.
The minimum is often most reproducible:
- Noise almost always makes things slower, not faster
- Minimum approximates "performance without interference"
- But hides tail latency (important for user experience)
The right choice depends on what you’re optimizing for.
27.3.4 Comparing Two Implementations
Is A faster than B? This requires statistical testing.
def is_faster(times_a, times_b, alpha=0.05):
"""
Test if A is statistically significantly faster than B.
Returns True if we can reject the null hypothesis (A >= B).
"""
# Mann-Whitney U test (doesn't assume normality)
stat, p_value = stats.mannwhitneyu(
times_a, times_b,
alternative='less' # Test if A < B
)
return {
'a_mean': np.mean(times_a),
'b_mean': np.mean(times_b),
'speedup': np.mean(times_b) / np.mean(times_a),
'p_value': p_value,
'significant': p_value < alpha
}Avoid: “A is 5% faster because the mean was 5% lower in one run.”
Do: “A is faster with p < 0.01; the 95% confidence interval for speedup is [1.03, 1.07].”
27.4 Micro vs. Macro Benchmarks
27.4.1 Microbenchmarks
Measure one specific operation in isolation.
# Microbenchmark: how fast is element access?
import timeit
setup = "import numpy as np; x = np.random.randn(10000)"
stmt = "x[5000]"
time_per_op = timeit.timeit(stmt, setup, number=1000000) / 1000000
print(f"{time_per_op * 1e9:.2f} ns per access")Pros: - Isolated: measures exactly what you intend - Fast: can run many iterations - Reproducible: minimal external factors
Cons: - Unrealistic: no cache pollution from surrounding code - Optimized away: compiler may remove “useless” code - Miss the forest: 1% of total runtime optimized 10× is… 1% total improvement
27.4.2 Macrobenchmarks
Measure the full application or realistic workload.
# Macrobenchmark: end-to-end training time
start = time.time()
for epoch in range(10):
train_one_epoch(model, data)
print(f"Total training time: {time.time() - start:.1f}s")Pros: - Realistic: captures actual bottlenecks - Holistic: includes all interactions - Meaningful: directly measures what you care about
Cons: - Slow: fewer iterations possible - Noisy: many sources of variation - Hard to attribute: where did the improvement come from?
27.4.3 The Right Level
Use microbenchmarks to: - Test hypotheses (“is memory the bottleneck?”) - Compare specific implementations - Understand component behavior
Use macrobenchmarks to: - Validate that micro-improvements transfer to real workloads - Find the actual bottleneck - Make go/no-go decisions
The workflow: 1. Profile macro workload → find bottleneck 2. Microbenchmark the bottleneck → understand it 3. Optimize → validate with microbenchmark 4. Verify improvement in macro workload
27.5 The Brendan Gregg Methodologies
Brendan Gregg has codified several systematic approaches to performance analysis.
27.5.1 The USE Method
For every resource (CPU, memory, disk, network):
- Utilization: % of time the resource is busy
- Saturation: Degree to which work is queued
- Errors: Count of error events
Resource Utilization Saturation Errors
─────────────────────────────────────────────────────────────
CPU % non-idle run queue length -
Memory % used swap rate OOM kills
Disk % busy queue depth device errors
Network % bandwidth used dropped packets errors
GPU SM utilization queue depth ECC errors
The beauty: systematic coverage. You won’t miss a resource.
# Linux commands for USE metrics
# CPU
top # Utilization
cat /proc/loadavg # Saturation (run queue)
# Memory
free -m # Utilization
vmstat 1 # Saturation (si/so for swap)
# Disk
iostat -x 1 # Utilization (%util), Saturation (avgqu-sz)
# Network
sar -n DEV 1 # Utilization
netstat -s # Errors27.5.2 The RED Method
For services (request-based systems):
- Rate: Requests per second
- Errors: Failed requests per second
- Duration: Latency of requests
class ServiceMetrics:
def __init__(self):
self.request_count = 0
self.error_count = 0
self.latency_histogram = []
def record_request(self, latency, success):
self.request_count += 1
if not success:
self.error_count += 1
self.latency_histogram.append(latency)
def report(self, period_seconds):
return {
'rate': self.request_count / period_seconds,
'error_rate': self.error_count / period_seconds,
'p50_latency': np.percentile(self.latency_histogram, 50),
'p99_latency': np.percentile(self.latency_histogram, 99),
}27.5.3 Flame Graphs
A visualization that shows where time is spent across the call stack.
┌──────────────────────────────────────────────────────────────┐
│ main │
├──────────────────────────────────────────────────────────────┤
│ train_epoch │ eval │
├─────────────────────────────────────────┼────────────────────┤
│ forward │ backward │ optim │ forward │ metrics│
├──────────────┼──────────────┼──────────┼───────────┼────────┤
│ attention│ffn│ attention│ffn│ │ │ │
└──────────────┴──────────────┴──────────┴───────────┴────────┘
Width = time spent in that function (including children)
Stack = call path from root
How to read: - Width indicates time - Look for wide “plateaus” (where time is spent) - Narrow spikes are function call overhead
# Generate flame graph with perf
perf record -g ./my_program
perf script | stackcollapse-perf.pl | flamegraph.pl > profile.svg27.6 Common Measurement Mistakes
27.6.1 1. Measuring Startup, Not Steady State
# BAD: First run includes JIT compilation
time_one_run(model) # Includes torch compile, CUDA kernel loading
# GOOD: Warm up, then measure
for _ in range(5): # Warmup
model(input)
times = [time_one_run(model) for _ in range(100)] # Measure27.6.2 2. Forgetting CUDA Synchronization
GPU operations are asynchronous. This measures nothing useful:
# BAD: Measures kernel launch time, not execution time
start = time.time()
result = model(input)
print(f"Time: {time.time() - start}") # GPU still working!
# GOOD: Wait for GPU
torch.cuda.synchronize()
start = time.time()
result = model(input)
torch.cuda.synchronize()
print(f"Time: {time.time() - start}")27.6.3 3. Dead Code Elimination
Compilers remove code with no observable effect:
// BAD: Compiler may eliminate this entirely
for (int i = 0; i < N; i++) {
result = compute(data[i]); // result never used after loop
}
// GOOD: Make result observable
volatile int sink;
for (int i = 0; i < N; i++) {
result = compute(data[i]);
}
sink = result; // Compiler can't eliminate# Python version with benchmark library
import timeit
# Use the result to prevent dead code elimination
timeit.timeit(lambda: compute(data), number=1000)27.6.4 4. Measuring the Measurement
Profiling has overhead. Heavy instrumentation can change what you’re measuring.
# Heisenberg effect: observing changes behavior
def profile_overhead_demo():
# Uninstrumented: 100ms
# With line-by-line profiler: 5000ms (50× overhead!)
# Solution: use sampling profilers for hot paths
# Reserve instrumentation for coarse-grained analysis
pass27.6.5 5. Cache State Differences
Cold cache vs. warm cache can differ 10×:
# BAD: Inconsistent cache state
times = []
for i in range(100):
start = time.time()
process(data) # First run: cold cache. Rest: warm cache.
times.append(time.time() - start)
# times[0] >> times[1:99]
# GOOD: Consistent cache state
for _ in range(5):
process(data) # Warmup
times = []
for i in range(100):
flush_cache() # or accept warm cache consistently
start = time.time()
process(data)
times.append(time.time() - start)27.7 GPU Benchmarking: The do_bench Pattern
GPU benchmarking has unique challenges that require specialized methodology. The approach used by Triton’s testing module represents current best practices.
27.7.1 Why GPU Benchmarking Is Different
CPU benchmarking:
- Measure wall time
- Variance from OS scheduling, caches
- Min/mean/median all useful
GPU benchmarking:
- Kernel launch is async (must synchronize!)
- Variance from thermal throttling, memory contention, PCIe
- Minimum is often most informative
- Cache state between runs matters significantly
27.7.2 The Case for Minimum
For GPU benchmarks, the minimum time is often the most reproducible and meaningful:
Sources of noise in GPU timing:
✗ Thermal throttling (adds time)
✗ Memory bus contention (adds time)
✗ Other processes (adds time)
✗ Power state transitions (adds time)
Notice: noise almost always ADDS time, never removes it.
The minimum represents:
✓ "Performance without interference"
✓ Most reproducible across runs
✓ What the hardware can actually achieve
Caveat: For latency-sensitive applications (user-facing inference), you care about tail latency (p99). Use minimum for optimization comparisons; use percentiles for SLA compliance.
27.7.3 A Production-Ready Benchmark Utility
Here’s a utility incorporating all best practices, inspired by Triton’s do_bench:
import torch
import time
from dataclasses import dataclass
from typing import Callable, Optional, List
@dataclass
class BenchmarkResult:
"""Results from GPU benchmarking."""
min_ms: float
max_ms: float
mean_ms: float
median_ms: float
std_ms: float
n_runs: int
warmup_runs: int
def __repr__(self):
return (f"BenchmarkResult(min={self.min_ms:.3f}ms, "
f"mean={self.mean_ms:.3f}ms, "
f"median={self.median_ms:.3f}ms, "
f"n={self.n_runs})")
@property
def throughput(self) -> float:
"""Operations per second based on min time."""
return 1000.0 / self.min_ms
def do_bench(
fn: Callable,
warmup_ms: float = 25.0,
target_time_ms: float = 100.0,
min_runs: int = 10,
max_runs: int = 1000,
quantiles: Optional[List[float]] = None,
clear_cache: bool = False,
cache_size_mb: int = 40,
) -> BenchmarkResult:
"""
Benchmark a GPU function with proper methodology.
Key features:
- Warmup until stable (time-based, not count-based)
- Proper CUDA synchronization
- Time-based repetitions for consistent results
- Optional cache clearing between runs
- Returns comprehensive statistics
Args:
fn: Function to benchmark (should include all GPU ops)
warmup_ms: Time to spend on warmup (default 25ms)
target_time_ms: Target total benchmarking time (default 100ms)
min_runs: Minimum number of timed runs
max_runs: Maximum number of timed runs
quantiles: If provided, return these percentiles
clear_cache: If True, flush GPU cache between runs
cache_size_mb: Size of cache-clearing buffer
Returns:
BenchmarkResult with timing statistics
"""
import numpy as np
# Cache-clearing buffer (if needed)
if clear_cache:
cache_buffer = torch.empty(
cache_size_mb * 1024 * 1024 // 4,
dtype=torch.float32,
device='cuda'
)
def run_once():
if clear_cache:
cache_buffer.zero_() # Flush L2 cache
torch.cuda.synchronize()
start = time.perf_counter()
fn()
torch.cuda.synchronize()
return (time.perf_counter() - start) * 1000 # ms
# Phase 1: Warmup (time-based)
warmup_runs = 0
warmup_start = time.perf_counter()
while (time.perf_counter() - warmup_start) * 1000 < warmup_ms:
fn()
warmup_runs += 1
torch.cuda.synchronize()
# Phase 2: Estimate single-run time
estimates = [run_once() for _ in range(3)]
estimate_ms = np.median(estimates)
# Phase 3: Calculate number of runs needed
if estimate_ms > 0:
n_runs = max(min_runs, int(target_time_ms / estimate_ms))
n_runs = min(n_runs, max_runs)
else:
n_runs = min_runs
# Phase 4: Timed runs
times = [run_once() for _ in range(n_runs)]
times = np.array(times)
result = BenchmarkResult(
min_ms=float(np.min(times)),
max_ms=float(np.max(times)),
mean_ms=float(np.mean(times)),
median_ms=float(np.median(times)),
std_ms=float(np.std(times)),
n_runs=n_runs,
warmup_runs=warmup_runs,
)
# Optional: return quantiles
if quantiles is not None:
result.quantiles = {
q: float(np.percentile(times, q * 100))
for q in quantiles
}
return result27.7.4 Usage Examples
# Basic usage
result = do_bench(lambda: model(input))
print(f"Forward pass: {result.min_ms:.2f}ms")
# With cache clearing (for memory-bound kernels)
result = do_bench(
lambda: torch.mm(A, B),
clear_cache=True,
target_time_ms=200,
)
# Get specific quantiles
result = do_bench(
lambda: attention(q, k, v),
quantiles=[0.5, 0.9, 0.99],
)
print(f"p50: {result.quantiles[0.5]:.2f}ms")
print(f"p99: {result.quantiles[0.99]:.2f}ms")
# Compare two implementations
result_a = do_bench(lambda: impl_a(x))
result_b = do_bench(lambda: impl_b(x))
speedup = result_b.min_ms / result_a.min_ms
print(f"A is {speedup:.2f}x faster than B")27.7.5 When to Clear Cache
Clear cache between runs when:
✓ Testing memory-bound kernels
✓ Measuring cold-start behavior
✓ Comparing algorithms with different memory patterns
Don't clear cache when:
✗ Measuring steady-state throughput
✗ Simulating real workloads (cache is usually warm)
✗ Speed is critical (cache clearing adds overhead)
27.7.6 Time-Based vs. Count-Based Repetitions
Why target a time budget instead of a fixed iteration count?
Problem with fixed counts:
- Fast kernel (1μs): 100 runs = 0.1ms total → noisy
- Slow kernel (100ms): 100 runs = 10s total → wasteful
Time-based approach:
- Target 100ms total benchmarking time
- Fast kernel: ~100,000 runs → stable statistics
- Slow kernel: ~1 run → respects your time
The do_bench utility handles this automatically.
27.7.7 Integration with Triton
When using Triton, you can use its built-in benchmarking:
import triton
# Triton's built-in benchmark
ms = triton.testing.do_bench(lambda: kernel[grid](a, b, c))
# With custom settings
ms, min_ms, max_ms = triton.testing.do_bench(
lambda: kernel[grid](a, b, c),
warmup=25,
rep=100,
return_mode='all',
)
# For comparing multiple implementations
@triton.testing.perf_report(
triton.testing.Benchmark(
x_names=['M'],
x_vals=[128, 256, 512, 1024],
line_arg='provider',
line_vals=['triton', 'torch'],
line_names=['Triton', 'PyTorch'],
xlabel='Matrix Size',
ylabel='Time (ms)',
)
)
def benchmark(M, provider):
if provider == 'triton':
return lambda: triton_matmul(a, b)
else:
return lambda: torch.mm(a, b)27.8 Building a Measurement Discipline
27.8.1 1. Define What You’re Measuring
Before writing any code: - What metric matters? (Throughput? Latency? Memory?) - What workload is representative? - What’s the baseline?
27.8.2 2. Make Measurements Reproducible
def set_reproducible_environment():
# Fix random seeds
np.random.seed(42)
torch.manual_seed(42)
torch.cuda.manual_seed_all(42)
# Disable nondeterminism
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
# Document system state
print(f"Python: {sys.version}")
print(f"PyTorch: {torch.__version__}")
print(f"CUDA: {torch.version.cuda}")
print(f"GPU: {torch.cuda.get_device_name()}")27.8.3 3. Automate Benchmarking
# benchmark.py
import json
from datetime import datetime
def run_benchmark_suite():
results = {
'timestamp': datetime.now().isoformat(),
'commit': get_git_commit(),
'system': get_system_info(),
'benchmarks': {}
}
for name, fn in BENCHMARKS.items():
results['benchmarks'][name] = benchmark(fn)
with open(f'benchmarks/{datetime.now():%Y%m%d_%H%M%S}.json', 'w') as f:
json.dump(results, f, indent=2)
return results27.8.4 4. Track Over Time
Performance regression detection:
Commit A: forward pass = 45.2 ± 0.3 ms
Commit B: forward pass = 47.8 ± 0.4 ms ← 5.7% regression!
Commit C: forward pass = 32.1 ± 0.2 ms ← 28.9% improvement!
CI systems can run benchmarks on every commit and alert on regressions.
27.9 Key Takeaways
Know what you’re measuring: Different tools measure different things. Wall time ≠ CPU time ≠ GPU time.
Statistical validity matters: One run is not a measurement. Use proper statistics to compare implementations.
Match the level to the question: Microbenchmarks for hypotheses, macrobenchmarks for validation.
Use systematic methodologies: USE for resources, RED for services. Don’t miss anything.
Avoid common traps: Warmup, synchronization, dead code elimination, cache state.
The goal is not to measure something. It’s to measure the right thing, correctly, and draw valid conclusions.
27.10 Further Reading
- Gregg (2020). “Systems Performance: Enterprise and the Cloud” (2nd ed.)
- Gregg. “The USE Method”: https://www.brendangregg.com/usemethod.html
- Mytkowicz et al. (2009). “Producing Wrong Data Without Doing Anything Obviously Wrong!”
- Fleming & Wallace (1986). “How Not to Lie with Statistics: The Correct Way to Summarize Benchmark Results”