The Extended Roofline
The roofline model transformed how we reason about single-device performance. For distributed systems, we extend it with a third ceiling that often dominates: network bandwidth.
The Question: You're training on 256 GPUs. Each step, every GPU computes gradients and you AllReduce them. Your per-GPU compute takes 100ms. Your AllReduce takes 200ms. You're spending ⅔ of your time on communication. How do you analyze this? How do you improve it?
The Original Roofline¶
Connection to The Algebra of Speed
The single-device roofline model is covered in depth in the companion book The Algebra of Speed. There, the focus is on memory-bandwidth and compute ceilings for individual kernels. Here we extend it with a third ceiling—network bandwidth—which often dominates at distributed scale.
The roofline model, introduced by Williams, Waterman, and Patterson (2009), bounds performance by two ceilings:
Where arithmetic intensity is:
Operations with high arithmetic intensity (matrix multiplication: \(O(n^3)\) FLOPs, \(O(n^2)\) memory) are compute-bound. Operations with low intensity (vector addition: \(O(n)\) FLOPs, \(O(n)\) memory) are memory-bound.
The Third Ceiling: Network¶
For distributed training, we add:
Where communication intensity is:
When plotting, interpret intensity relative to the resource you're comparing: \(I_{\text{mem}}\) for memory ceilings and \(I_{\text{net}}\) for network ceilings.
This third ceiling often dominates. Consider AllReduce of a 10GB gradient tensor:
- Communication: ~20GB per GPU (ReduceScatter + AllGather); total network volume scales with \(P\)
- At 400 Gbps InfiniBand: \(\frac{20 \times 8}{400} = 0.4\) seconds
- Compute for same step might be 0.1 seconds
We're 4× communication-bound.
Visualizing the Extended Roofline¶
The extended roofline model bounds performance by three ceilings. The chart below shows performance vs. arithmetic intensity on schematic axes:
xychart-beta
title "Extended Roofline Model"
x-axis "Arithmetic Intensity (FLOPs/byte)" [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
y-axis "Performance (TFLOP/s)" 0 --> 2000
line "Network Ceiling" [50, 100, 200, 400, 800, 989, 989, 989, 989, 989, 989]
line "Memory Ceiling" [100, 200, 400, 800, 989, 989, 989, 989, 989, 989, 989]
line "Compute Ceiling" [989, 989, 989, 989, 989, 989, 989, 989, 989, 989, 989]
How to read this:
- Sloped regions: Performance scales with intensity (bandwidth-limited)
- Flat region: Performance hits peak FLOP/s (compute-limited)
- Ridge points: Where slopes meet ceilings (transitions between regimes)
| Region | Ceiling | Intensity | Performance Limited By |
|---|---|---|---|
| Left slope | Network | Low \(I_{net}\) | Network bandwidth |
| Middle slope | Memory | Low \(I_{mem}\) | HBM bandwidth |
| Flat top | Compute | High | Peak FLOP/s |
Ridge points mark transitions between regimes. For H100 (~989 TFLOP/s dense FP16/BF16) with 400 Gbps InfiniBand:
The Three Regimes¶
Compute-Bound¶
When \(I_{\text{net}}\) and \(I_{\text{mem}}\) are high:
- GPU utilization is near peak
- Adding more GPUs can help linearly while communication and memory overheads stay subdominant
- Example: Large batch matrix multiplications
Memory-Bound¶
When \(I_{\text{mem}}\) is low, \(I_{\text{net}}\) is high:
- Limited by HBM bandwidth
- Techniques: Kernel fusion, recomputation
- Example: Element-wise operations, small batch attention
Communication-Bound¶
When \(I_{\text{net}}\) is low:
- Limited by network bandwidth or latency
- Techniques: Gradient compression, overlap, reduced precision communication
- Example: Gradient synchronization, parameter servers
Practice
If NCCL/communication is >30% of step time, estimate \(I_{\text{net}}\) and compare to the ridge point. For H100 + 400 Gbps InfiniBand, \(I_{\text{ridge}} \approx 20k\) FLOPs/byte. Below that, increase batch/accumulation or move low-intensity collectives to faster links.
Calculating Communication Intensity¶
For different parallelism strategies:
Data Parallelism¶
Let \(B\) be global batch size (sequences), \(S\) sequence length, \(P\) GPUs, and \(s\) bytes per parameter (e.g., \(s=2\) for FP16/BF16). Per step per GPU: \(6\Psi B S / P\) FLOPs (forward + backward on \(B/P\) samples), \(\approx 2\Psi s\) bytes AllReduced (for large \(P\))
As per-GPU token batch \(B \cdot S / P\) increases, communication intensity increases → less communication-bound.
Tensor Parallelism¶
Per layer with hidden dim \(H\), batch \(B\), sequence \(S\):
- FLOPs: \(O(BSH^2)\)
- Communication: \(O(BSH)\) (AllGather/ReduceScatter activations across tensor-parallel ranks)
Higher hidden dimension → higher intensity → more efficient tensor parallelism.
Pipeline Parallelism¶
Communication only at stage boundaries:
- FLOPs: Full model computation
- Communication: Activation tensors between stages
Generally the most communication-efficient strategy.
The Ridge Point¶
The ridge point is where two ceilings intersect. For distributed training:
For an H100 (989 TFLOP/s) with 400 Gbps (50 GB/s) InfiniBand:
Operations with communication intensity below this are network-bound.
Implications for System Design¶
- Maximize communication intensity: Increase batch size, use gradient accumulation
- Overlap communication with compute: Hide latency behind useful work
- Compress communication: Reduce bytes transferred
- Choose topology wisely: TP within nodes (high BW), PP across nodes (tolerates latency)
The extended roofline is our primary tool for analyzing distributed training bottlenecks.
Exercises¶
- Calculate the communication intensity of training a 7B parameter model with batch size 1M tokens across 64 GPUs using data parallelism.
Solution
Given:
- Parameters: \(\Psi = 7 \times 10^9\)
- Tokens per step: \(B_{\text{tok}} = 1 \times 10^6\)
- GPUs: \(P = 64\) (data parallel)
Per-GPU FLOPs per step:
Per-GPU bytes communicated (AllReduce gradients in FP16):
Communication intensity:
This is above the InfiniBand ridge point (~19,780 FLOPs/byte), so the workload is compute-bound—but just barely. Halving the batch size to 500K tokens would drop intensity to ~11,700 FLOPs/byte, making the workload communication-bound over InfiniBand.
- An H100 DGX system has 900 GB/s NVLink bandwidth within the node and 400 Gbps InfiniBand across nodes. What's the ratio of ridge points for intra-node vs inter-node communication?
Solution
Ridge point formula:
For H100 (989 TFLOP/s peak):
NVLink (900 GB/s):
InfiniBand (400 Gbps = 50 GB/s):
Ratio: $\(\frac{I_{ridge}^{IB}}{I_{ridge}^{NVLink}} = \frac{19,780}{1,099} \approx 18\times\)$
Implication: Operations need 18× higher arithmetic intensity to be compute-bound over InfiniBand vs NVLink. This is why tensor parallelism (low intensity) uses NVLink, while data parallelism (high intensity) can use InfiniBand.
- You observe 35% MFU (Model FLOP Utilization) on a training run. Using the extended roofline, what are the possible bottlenecks and how would you diagnose which one dominates?
Solution
Possible bottlenecks (65% of potential is lost):
- Network-bound: Communication not overlapped with compute
- Memory-bound: Low arithmetic intensity operations (attention, LayerNorm)
- Pipeline bubbles: Idle time at stage boundaries
- Load imbalance: Some GPUs finishing before others
- Kernel inefficiency: Suboptimal GPU kernels
Diagnostic approach:
| Symptom | Likely Cause | How to Check |
|---|---|---|
| High NCCL time in profile | Network-bound | Profile shows exposed AllReduce |
| Low SM occupancy | Memory-bound | NSight shows memory stalls |
| Periodic idle gaps | Pipeline bubbles | Timeline shows stage gaps |
| Uneven GPU utilization | Load imbalance | Per-GPU profiling |
Steps:
- Run profiler (PyTorch Profiler, NSight)
- Check communication/compute overlap in timeline
- Measure achieved bandwidth vs theoretical
- Check arithmetic intensity of dominant operations
- If pipeline parallel, calculate bubble fraction: \(\frac{p-1}{m+p-1}\)
Key Takeaways¶
- Distributed performance has three ceilings: compute, memory bandwidth, and network bandwidth.
- Communication intensity is the diagnostic lever: low \(I_{\text{net}}\) means you are network-bound.
- Topology choice follows the ridge point: TP prefers high-bandwidth links; DP tolerates slower networks when batches are large.