The Extended Roofline

The roofline model transformed how we reason about single-device performance. For distributed systems, we extend it with a third ceiling that often dominates: network bandwidth.

The Question: You're training on 256 GPUs. Each step, every GPU computes gradients and you AllReduce them. Your per-GPU compute takes 100ms. Your AllReduce takes 200ms. You're spending ⅔ of your time on communication. How do you analyze this? How do you improve it?

The Original Roofline¶

Connection to The Algebra of Speed

The single-device roofline model is covered in depth in the companion book The Algebra of Speed. There, the focus is on memory-bandwidth and compute ceilings for individual kernels. Here we extend it with a third ceiling—network bandwidth—which often dominates at distributed scale.

The roofline model, introduced by Williams, Waterman, and Patterson (2009), bounds performance by two ceilings:

\[\text{Performance} \leq \min\left(\text{Peak FLOP/s}, \quad \text{Memory BW} \times \text{Arithmetic Intensity}\right)\]

Where arithmetic intensity is:

\[I = \frac{\text{FLOPs}}{\text{Bytes accessed from memory}}\]

Operations with high arithmetic intensity (matrix multiplication: $O(n^3)$ FLOPs, $O(n^2)$ memory) are compute-bound. Operations with low intensity (vector addition: $O(n)$ FLOPs, $O(n)$ memory) are memory-bound.

The Third Ceiling: Network¶

For distributed training, we add:

\[\text{Performance} \leq \min\left(\text{Peak FLOP/s}, \quad \text{Mem BW} \times I_{\text{mem}}, \quad \text{Net BW} \times I_{\text{net}}\right)\]

Where communication intensity is:

\[I_{\text{net}} = \frac{\text{FLOPs}}{\text{Bytes communicated}}\]

When plotting, interpret intensity relative to the resource you're comparing: $I_{\text{mem}}$ for memory ceilings and $I_{\text{net}}$ for network ceilings.

This third ceiling often dominates. Consider AllReduce of a 10GB gradient tensor:

Communication: ~20GB per GPU (ReduceScatter + AllGather); total network volume scales with $P$
At 400 Gbps InfiniBand: $\frac{20 \times 8}{400} = 0.4$ seconds
Compute for same step might be 0.1 seconds

We're 4× communication-bound.

Visualizing the Extended Roofline¶

The extended roofline model bounds performance by three ceilings. The chart below shows performance vs. arithmetic intensity on schematic axes:

xychart-beta
    title "Extended Roofline Model"
    x-axis "Arithmetic Intensity (FLOPs/byte)" [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]
    y-axis "Performance (TFLOP/s)" 0 --> 2000
    line "Network Ceiling" [50, 100, 200, 400, 800, 989, 989, 989, 989, 989, 989]
    line "Memory Ceiling" [100, 200, 400, 800, 989, 989, 989, 989, 989, 989, 989]
    line "Compute Ceiling" [989, 989, 989, 989, 989, 989, 989, 989, 989, 989, 989]

How to read this:

Sloped regions: Performance scales with intensity (bandwidth-limited)
Flat region: Performance hits peak FLOP/s (compute-limited)
Ridge points: Where slopes meet ceilings (transitions between regimes)

Region	Ceiling	Intensity	Performance Limited By
Left slope	Network	Low $I_{net}$	Network bandwidth
Middle slope	Memory	Low $I_{mem}$	HBM bandwidth
Flat top	Compute	High	Peak FLOP/s

Ridge points mark transitions between regimes. For H100 (~989 TFLOP/s dense FP16/BF16) with 400 Gbps InfiniBand:

\[I_{ridge}^{net} = \frac{989 \times 10^{12} \text{ FLOP/s}}{50 \times 10^9 \text{ B/s}} \approx 19,780 \text{ FLOPs/byte}\]

The Three Regimes¶

Compute-Bound¶

When $I_{\text{net}}$ and $I_{\text{mem}}$ are high:

GPU utilization is near peak
Adding more GPUs can help linearly while communication and memory overheads stay subdominant
Example: Large batch matrix multiplications

Memory-Bound¶

When $I_{\text{mem}}$ is low, $I_{\text{net}}$ is high:

Limited by HBM bandwidth
Techniques: Kernel fusion, recomputation
Example: Element-wise operations, small batch attention

Communication-Bound¶

When $I_{\text{net}}$ is low:

Limited by network bandwidth or latency
Techniques: Gradient compression, overlap, reduced precision communication
Example: Gradient synchronization, parameter servers

Practice

If NCCL/communication is >30% of step time, estimate $I_{\text{net}}$ and compare to the ridge point. For H100 + 400 Gbps InfiniBand, $I_{\text{ridge}} \approx 20k$ FLOPs/byte. Below that, increase batch/accumulation or move low-intensity collectives to faster links.

Calculating Communication Intensity¶

For different parallelism strategies:

Data Parallelism¶

Let $B$ be global batch size (sequences), $S$ sequence length, $P$ GPUs, and $s$ bytes per parameter (e.g., $s=2$ for FP16/BF16). Per step per GPU: $6\Psi B S / P$ FLOPs (forward + backward on $B/P$ samples), $\approx 2\Psi s$ bytes AllReduced (for large $P$)

\[I_{\text{net}}^{\text{DP}} = \frac{6\Psi B S / P}{2\Psi s} = \frac{3BS}{Ps}\]

As per-GPU token batch $B \cdot S / P$ increases, communication intensity increases → less communication-bound.

Tensor Parallelism¶

Per layer with hidden dim $H$, batch $B$, sequence $S$:

FLOPs: $O(BSH^2)$
Communication: $O(BSH)$ (AllGather/ReduceScatter activations across tensor-parallel ranks)

\[I_{\text{net}}^{\text{TP}} = O(H)\]

Higher hidden dimension → higher intensity → more efficient tensor parallelism.

Pipeline Parallelism¶

Communication only at stage boundaries:

FLOPs: Full model computation
Communication: Activation tensors between stages

\[I_{\text{net}}^{\text{PP}} = \frac{\text{Total FLOPs}}{\text{Activation size} \times \text{num stages}}\]

Generally the most communication-efficient strategy.

The Ridge Point¶

The ridge point is where two ceilings intersect. For distributed training:

\[I_{\text{ridge}} = \frac{\text{Peak FLOP/s}}{\text{Network BW}}\]

For an H100 (989 TFLOP/s) with 400 Gbps (50 GB/s) InfiniBand:

\[I_{\text{ridge}} = \frac{989 \times 10^{12}}{50 \times 10^9} = 19,780 \text{ FLOPs/byte}\]

Operations with communication intensity below this are network-bound.

Implications for System Design¶

Maximize communication intensity: Increase batch size, use gradient accumulation
Overlap communication with compute: Hide latency behind useful work
Compress communication: Reduce bytes transferred
Choose topology wisely: TP within nodes (high BW), PP across nodes (tolerates latency)

The extended roofline is our primary tool for analyzing distributed training bottlenecks.

Exercises¶

Calculate the communication intensity of training a 7B parameter model with batch size 1M tokens across 64 GPUs using data parallelism.

Solution

Given:

Parameters: $\Psi = 7 \times 10^9$
Tokens per step: $B_{\text{tok}} = 1 \times 10^6$
GPUs: $P = 64$ (data parallel)

Per-GPU FLOPs per step:

\[\text{FLOPs}_{\text{per-GPU}} = \frac{6 \times \Psi \times B_{\text{tok}}}{P} = \frac{6 \times 7 \times 10^9 \times 10^6}{64} = 6.56 \times 10^{14}\]

Per-GPU bytes communicated (AllReduce gradients in FP16):

\[\text{Bytes} = 2 \times \frac{P-1}{P} \times \Psi \times 2 \approx 2 \times 7 \times 10^9 \times 2 = 2.8 \times 10^{10}\]

Communication intensity:

\[I_{\text{net}} = \frac{6.56 \times 10^{14}}{2.8 \times 10^{10}} \approx 23{,}400 \text{ FLOPs/byte}\]

This is above the InfiniBand ridge point (~19,780 FLOPs/byte), so the workload is compute-bound—but just barely. Halving the batch size to 500K tokens would drop intensity to ~11,700 FLOPs/byte, making the workload communication-bound over InfiniBand.

An H100 DGX system has 900 GB/s NVLink bandwidth within the node and 400 Gbps InfiniBand across nodes. What's the ratio of ridge points for intra-node vs inter-node communication?

Solution

Ridge point formula:

\[I_{ridge} = \frac{\text{Peak FLOPs}}{\text{Bandwidth}}\]

For H100 (989 TFLOP/s peak):

NVLink (900 GB/s):

\[I_{ridge}^{NVLink} = \frac{989 \times 10^{12}}{900 \times 10^9} = 1,099 \text{ FLOPs/byte}\]

InfiniBand (400 Gbps = 50 GB/s):

\[I_{ridge}^{IB} = \frac{989 \times 10^{12}}{50 \times 10^9} = 19,780 \text{ FLOPs/byte}\]

Ratio: $$\frac{I_{ridge}^{IB}}{I_{ridge}^{NVLink}} = \frac{19,780}{1,099} \approx 18\times$$

Implication: Operations need 18× higher arithmetic intensity to be compute-bound over InfiniBand vs NVLink. This is why tensor parallelism (low intensity) uses NVLink, while data parallelism (high intensity) can use InfiniBand.

You observe 35% MFU (Model FLOP Utilization) on a training run. Using the extended roofline, what are the possible bottlenecks and how would you diagnose which one dominates?

Solution

Possible bottlenecks (65% of potential is lost):

Network-bound: Communication not overlapped with compute
Memory-bound: Low arithmetic intensity operations (attention, LayerNorm)
Pipeline bubbles: Idle time at stage boundaries
Load imbalance: Some GPUs finishing before others
Kernel inefficiency: Suboptimal GPU kernels

Diagnostic approach:

Symptom	Likely Cause	How to Check
High NCCL time in profile	Network-bound	Profile shows exposed AllReduce
Low SM occupancy	Memory-bound	NSight shows memory stalls
Periodic idle gaps	Pipeline bubbles	Timeline shows stage gaps
Uneven GPU utilization	Load imbalance	Per-GPU profiling

Steps:

Run profiler (PyTorch Profiler, NSight)
Check communication/compute overlap in timeline
Measure achieved bandwidth vs theoretical
Check arithmetic intensity of dominant operations
If pipeline parallel, calculate bubble fraction: $\frac{p-1}{m+p-1}$

Key Takeaways¶

Distributed performance has three ceilings: compute, memory bandwidth, and network bandwidth.
Communication intensity is the diagnostic lever: low $I_{\text{net}}$ means you are network-bound.
Topology choice follows the ridge point: TP prefers high-bandwidth links; DP tolerates slower networks when batches are large.

Region	Ceiling	Intensity	Performance Limited By
Left slope	Network	Low \(I_{net}\)	Network bandwidth
Middle slope	Memory	Low \(I_{mem}\)	HBM bandwidth
Flat top	Compute	High	Peak FLOP/s