30 Profiling Tools: From Novice to Expert
Mastering the Art of Measurement
You can’t optimize what you can’t measure. But most engineers use profilers wrong.
They sample too infrequently, misinterpret results, chase noise, and miss the real bottlenecks. This chapter teaches profiling literacy—the foundation of all performance work.
30.1 The Profiler’s Mindset
Before diving into tools, understand what profilers can and cannot tell you:
Profilers can: - Show where time is spent (hotspots) - Identify CPU vs memory vs I/O bound - Reveal unexpected behavior (cache misses, branch mispredicts) - Compare before/after optimizations
Profilers cannot: - Tell you WHY something is slow (you must hypothesize) - Fix performance bugs (only reveal them) - Replace thinking with data - Give exact numbers (sampling introduces variance)
The golden rule: Profile to form hypotheses, not to find answers.
30.2 CPU Profiling with perf
perf is the Swiss Army knife of Linux performance analysis. It’s built into the kernel, low-overhead, and comprehensive.
30.2.1 Quick Start: Finding Hotspots
# Record performance data
perf record -g ./my_program
# View results
perf reportWhat you see:
# Overhead Command Shared Object Symbol
# ........ ......... ................. ......
45.23% my_program my_program [.] matrix_multiply
23.15% my_program libc-2.31.so [.] memcpy
12.45% my_program my_program [.] softmax_kernel
8.32% my_program libpython3.9.so [.] PyEval_EvalFrameDefault
Interpretation: - 45% of CPU time in matrix_multiply - prime optimization target - 23% in memcpy - might be unnecessary copies - 12% in softmax_kernel - worth investigating if it’s on critical path - 8% in Python interpreter - acceptable overhead
30.2.2 Understanding Sampling
perf works by sampling - interrupting the program periodically and recording state.
Default: 1000 samples/second (1ms intervals)
# Increase sampling frequency (more accurate, higher overhead)
perf record -F 4000 -g ./my_program # 4000 Hz
# Decrease for low-overhead production profiling
perf record -F 99 -g ./my_program # 99 HzImportant: Higher frequency ≠ always better. Overhead can distort results.
30.2.3 Hardware Performance Counters
Modern CPUs have hundreds of performance counters. perf exposes them.
Cache misses:
perf stat -e cache-references,cache-misses ./my_programOutput:
Performance counter stats for './my_program':
1,234,567,890 cache-references
45,678,901 cache-misses # 3.7% of all cache refs
2.345678 seconds time elapsed
Interpretation: 3.7% cache miss rate. Good for sequential access; concerning for random access.
Branch mispredictions:
perf stat -e branches,branch-misses ./my_programCPU cycles and instructions:
perf stat -e cycles,instructions ./my_programIPC (Instructions Per Cycle): instructions / cycles - IPC > 2: Good CPU utilization - IPC < 1: Likely memory-bound or stalled
30.2.4 Finding Cache Hotspots
perf record -e cache-misses -g ./my_program
perf reportThis shows which functions cause cache misses, not just the count.
30.2.5 Annotated Source
See exactly which lines are slow:
perf record ./my_program
perf annotate matrix_multiplyOutput shows assembly with cycle counts:
0.00 │ mov %rdi,%rax
34.21 │ vmovups (%rsi,%rax,4),%ymm0 ← Hot! 34% of time
0.15 │ vfmadd231ps (%rdx,%rax,4),%ymm0,%ymm1
28.43 │ vmovups 0x20(%rsi,%rax,4),%ymm0
...This memory load takes 34% of time - likely a cache miss.
30.2.6 Off-CPU Profiling
CPU profiling only shows time ON the CPU. What about blocked threads?
# Requires bpftrace or similar
perf record -e 'sched:sched_switch' -g --call-graph dwarf ./my_programShows time spent waiting (I/O, locks, etc.).
30.3 GPU Profiling with Nsight
GPU profiling is fundamentally different from CPU profiling.
30.3.1 Nsight Systems: The Big Picture
Nsight Systems shows timeline: what the GPU is doing over time.
nsys profile -o output.qdrep python train.pyOpen output.qdrep in Nsight Systems GUI.
What you see: - CUDA API calls (cudaMemcpy, kernel launches) - Kernel execution on GPU - CPU threads - Gaps (idle time)
Common patterns to look for:
- GPU bubbles: Gaps between kernels
- Cause: CPU launch overhead
- Fix: Kernel fusion, larger batches
- CPU-GPU synchronization:
- Cause: Unnecessary
synchronize()calls - Fix: Remove or defer synchronization
- Cause: Unnecessary
- Memory copy dominance:
- Cause: Too much H2D or D2H traffic
- Fix: Keep data on GPU, fuse operations
30.3.2 Nsight Compute: Kernel-Level Analysis
Nsight Compute profiles individual kernels in detail.
ncu -o profile --set full python train.pyOpen profile.ncu-rep in Nsight Compute GUI.
Key metrics:
SM Efficiency: - Percentage of time SMs are actively computing - Target: >80% - Low? Check occupancy or warp utilization
Occupancy: - Percentage of GPU resources utilized - Target: >50% (not always higher is better) - Low? Register or shared memory pressure
Memory Throughput: - Bytes/second vs. theoretical peak - Target: >70% for memory-bound kernels - Low? Memory access pattern issues (coalescing)
Example output:
Kernel: flash_attention_forward
Duration: 2.34 ms
SM Efficiency: 45.2% ← Could be better
Occupancy: 31.2% ← Low! Register pressure?
Memory Throughput: 1.2 TB/s (85% of peak) ← Good
Interpretation: - High memory throughput (85%) - memory access pattern is good - Low occupancy (31%) - using too many registers? - Low SM efficiency - might be waiting on memory despite good throughput
Action: Reduce register usage or shared memory usage to increase occupancy.
30.3.3 Roofline Analysis in Nsight
Nsight Compute can generate roofline plots:
ncu --metrics sm__sass_average_data_bytes_per_sector_mem_global_op_ld.pct,\
sm__sass_average_data_bytes_per_sector_mem_global_op_st.pct \
python train.pyShows kernel position on roofline: - Below roofline: Optimization opportunity - On roofline: Near-optimal for arithmetic intensity
30.4 PyTorch Profiler
PyTorch has built-in profiling for Python and GPU.
import torch
from torch.profiler import profile, ProfilerActivity
model = MyModel().cuda()
inputs = torch.randn(32, 3, 224, 224).cuda()
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
record_shapes=True,
with_stack=True
) as prof:
model(inputs)
# Print summary
print(prof.key_averages().table(sort_by="cuda_time_total"))
# Export for TensorBoard
prof.export_chrome_trace("trace.json")Output:
--------------------------------- ------------ ------------ --------
Name CPU time CUDA time Calls
--------------------------------- ------------ ------------ --------
aten::conv2d 5.234ms 45.678ms 16
aten::batch_norm 2.123ms 12.345ms 16
aten::relu 0.456ms 3.456ms 16
TensorBoard visualization:
tensorboard --logdir=./logsShows timeline similar to Nsight Systems but with PyTorch op granularity.
30.4.1 Finding Data Movement Issues
with profile(
activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
profile_memory=True
) as prof:
model(inputs)
print(prof.key_averages().table(
sort_by="self_cuda_memory_usage",
row_limit=10
))Shows which ops allocate the most memory.
30.5 Flame Graphs
Flame graphs visualize call stacks from profiling data.
30.5.1 Generating Flame Graphs
# Record with perf
perf record -F 99 -a -g -- sleep 30
# Convert to flame graph
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svgReading flame graphs: - X-axis: Alphabetical order (NOT time!) - Y-axis: Call stack depth - Width: Time spent in that function - Color: Random (for differentiation)
What to look for: - Wide plateaus at top = hotspots - Tall towers = deep call stacks (recursion or abstraction layers)
Example interpretation:
┌─────────────────────────────────────────────┐
│ main │ 100%
├────────────┬──────────────┬─────────────────┤
│ forward() │ backward() │ optimizer.step()│
│ 40% │ 45% │ 15% │
├──────┬─────┼──────┬───────┼─────────────────┤
│matmul│norm │matmul│softmax│ │
│ 30% │ 10% │ 35% │ 10% │ │
Insight: matmul in backward (35%) is the biggest single hotspot.
30.6 Common Profiling Mistakes
30.6.1 Mistake 1: Profiling Debug Builds
# DON'T
gcc -g -O0 my_program.c && perf record ./a.out
# DO
gcc -O3 -g my_program.c && perf record ./a.outAlways profile optimized builds with debug symbols (-O3 -g).
30.6.2 Mistake 2: Too-Short Runs
# DON'T - noise dominates
for _ in range(10):
model(input)
# DO - enough samples for significance
for _ in range(1000):
model(input)Profile long enough that variance is <5%.
30.6.3 Mistake 3: Profiling Different Code Paths
# DON'T - comparing apples to oranges
if use_optimized:
result = fast_path(x)
else:
result = slow_path(x) # Different algorithm!Profile the same algorithm before/after optimization.
30.6.4 Mistake 4: Ignoring Variance
# Single run - could be luck
perf stat ./my_program
# Better - 10 runs, report median and stddev
for i in {1..10}; do perf stat ./my_program 2>&1; done | grep "time elapsed"Report distributions, not single numbers.
30.6.5 Mistake 5: Profile-Driven Development
Profiling should inform optimization, not drive it.
Bad workflow: 1. Profile 2. Optimize hottest function 3. Repeat
Good workflow: 1. Understand what should be fast (theory) 2. Profile to see what actually is slow 3. Form hypothesis about why 4. Optimize based on hypothesis 5. Profile to verify
30.7 Memory Profiling
30.7.1 Cachegrind (Valgrind)
Simulates cache hierarchy and counts misses.
valgrind --tool=cachegrind ./my_programOutput:
==12345== I refs: 100,000,000
==12345== I1 misses: 1,234
==12345== LLi misses: 567
==12345== I1 miss rate: 0.00%
==12345== LLi miss rate: 0.00%
==12345==
==12345== D refs: 50,000,000
==12345== D1 misses: 1,234,567 (1,000,000 rd + 234,567 wr)
==12345== LLd misses: 234,567 ( 200,000 rd + 34,567 wr)
==12345== D1 miss rate: 2.5%
==12345== LLd miss rate: 0.5%
Warning: Valgrind is slow (10-50× overhead). Use for understanding, not benchmarking.
30.7.2 heaptrack (Heap Allocations)
Tracks memory allocations over time.
heaptrack ./my_program
heaptrack_gui heaptrack.my_program.12345.gzShows: - Peak memory usage - Allocation hotspots - Leaks
30.8 Production Profiling
30.8.1 Continuous Profiling
Tools like pprof (Google), Pyroscope, or Datadog enable profiling in production.
Why important: - Synthetic benchmarks miss real workloads - Performance issues emerge over time - Different code paths in production
Example with pprof (Go):
import _ "net/http/pprof"
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()Access profiles at http://localhost:6060/debug/pprof/
For Python (py-spy):
# Profile running Python process
sudo py-spy record -o profile.svg --pid 12345 --duration 6030.8.2 Sampling Strategy
Development: High frequency (1000 Hz+), short runs Staging: Medium frequency (100-500 Hz), realistic workloads Production: Low frequency (10-99 Hz), continuous
The “99 Hz rule”: Prime number avoids aliasing with periodic events.
30.9 Statistical Significance
30.9.1 Measuring Variance
# Run benchmark 30 times
for i in {1..30}; do
/usr/bin/time -f "%e" ./my_program 2>&1 | tail -1
done > times.txt
# Calculate mean and stddev
awk '{sum+=$1; sumsq+=$1*$1} END {
mean=sum/NR;
stddev=sqrt(sumsq/NR - mean*mean);
printf "Mean: %.3f\nStddev: %.3f\nCoV: %.1f%%\n",
mean, stddev, 100*stddev/mean
}' times.txtCoefficient of Variation (CoV): stddev / mean - CoV < 5%: Good, trustworthy - CoV 5-10%: Moderate, run more samples - CoV > 10%: High variance, investigate why
30.9.2 Comparing Optimizations
Is the speedup real or noise?
Use t-test or Mann-Whitney U test:
from scipy import stats
baseline = [2.31, 2.34, 2.29, 2.33, 2.32] # seconds
optimized = [1.98, 2.01, 1.99, 2.02, 2.00]
# Mann-Whitney U test (non-parametric)
statistic, p_value = stats.mannwhitneyu(baseline, optimized)
if p_value < 0.05:
print(f"Speedup is statistically significant (p={p_value:.4f})")
print(f"Median speedup: {np.median(baseline) / np.median(optimized):.2f}×")
else:
print(f"No significant difference (p={p_value:.4f})")Rule: Only claim speedup if p < 0.05 (95% confidence).
30.10 Profiling Workflow
30.10.1 The Six-Step Process
Establish Baseline
for i in {1..10}; do perf stat ./program; done > baseline.txtProfile Hotspots
perf record -g ./program perf reportForm Hypothesis “Function X is slow because it’s cache-misses”
Verify Hypothesis
perf stat -e cache-misses -g ./programOptimize Change code based on hypothesis
Measure Impact
for i in {1..10}; do perf stat ./program_optimized; done > optimized.txt # Compare with statistical test
30.11 Tools Cheat Sheet
| Task | CPU Tool | GPU Tool | Overhead |
|---|---|---|---|
| Hotspots | perf record -g |
Nsight Systems | Low |
| Cache misses | perf stat -e cache-misses |
- | Very low |
| Kernel analysis | - | Nsight Compute | High |
| Memory allocations | heaptrack | Nsight Systems | Medium |
| Timeline | perf record -e sched:* |
Nsight Systems | Medium |
| Production | py-spy, pprof | - | Very low |
| Detailed assembly | perf annotate |
Nsight Compute (SASS) | - |
30.12 Connections
Chapter 1 (Memory Hierarchy): Profilers reveal cache misses that validate the hierarchy model.
Chapter 2 (Bandwidth): Memory throughput metrics in Nsight confirm bandwidth bounds.
Chapter 15 (Measurement): This chapter provides the tools for the measurement methodology.
Chapter 16 (Hypothesis): Profiling data forms the basis for performance hypotheses.
30.13 Key Takeaways
Profile optimized code: Debug builds are misleading.
Sample wisely: 99-1000 Hz depending on environment. Higher isn’t always better.
Measure variance: Report median ± stddev, not single runs.
Use hardware counters: Cache misses, branch mispredicts, IPC reveal bottlenecks.
GPU timeline first, kernel details second: Nsight Systems before Nsight Compute.
Statistical significance matters: p < 0.05 or the speedup might be noise.
Profile informs, theory guides: Don’t optimize blindly based on hotspots.
Week 1: Master perf stat and perf record Week 2: Learn PyTorch profiler + TensorBoard Week 3: Nsight Systems basics Week 4: Nsight Compute for kernel optimization Week 5: Statistical analysis of results
By week 5, you’ll be profiling like a pro.
30.14 Further Reading
- Brendan Gregg’s “Systems Performance” - comprehensive profiler reference
perfdocumentation: https://perf.wiki.kernel.org/- Nsight documentation: https://developer.nvidia.com/nsight-systems
- PyTorch profiler tutorial: https://pytorch.org/tutorials/recipes/recipes/profiler_recipe.html