The Collective Cost Model
Armed with α-β analysis and algorithm knowledge, we can predict the cost of any collective operation before running it. This predictive capability is essential for capacity planning.
The Question: Your AllReduce takes 500ms. Is that good or bad? How do you know if you're achieving near-optimal performance? What's the gap between theoretical and achieved, and where does the gap come from?
Chapter Map
Prerequisites: Chapter 4 (α-β model fundamentals), Chapter 12 (ring vs tree algorithms)
Key insight: Every collective has a predictable cost formula. Compare your measured times against these formulas to identify inefficiencies—the gap reveals whether you're limited by latency, bandwidth, or implementation overhead.
The Complete Cost Formulas¶
Using the α-β model, we can predict communication time for any collective.
Theory
The formulas below are algorithmic lower bounds under idealized assumptions (perfect overlap, no contention).
Practice
Real systems typically achieve ~70–90% of these bounds depending on topology, kernel efficiency, and overlap.
Point-to-Point Operations¶
Send/Recv:
This is the foundation. All collective costs derive from compositions of point-to-point operations.
Broadcast and Reduce¶
Broadcast (tree algorithm):
Reduce (tree algorithm):
Note: Both use tree algorithms because they have single source/destination.
Scatter and Gather¶
Scatter (binomial tree):
The bandwidth term improves because at each level, data size halves.
Gather (binomial tree):
AllReduce¶
Ring algorithm (optimal for large messages):
Tree algorithm (optimal for small messages):
Recursive halving-doubling (best of both for power-of-2 P):
Practice
Compare measured NCCL time to the formula above. If you're <70% of the model, you're likely latency-bound (too many small buckets) or bandwidth-bound (contention). Use bigger buckets for latency, or reduce overlap contention for bandwidth.
AllGather¶
Ring algorithm:
Where \(n\) is the total output size (each process contributes \(n/P\)).
ReduceScatter¶
Ring algorithm:
Where \(n\) is the total input size (each process outputs \(n/P\)).
All-to-All¶
Pairwise exchange:
Where \(n\) is the total data per process (sends \(n/P\) to each other process).
Summary Table¶
| Collective | Latency Term | Bandwidth Term | Algorithm |
|---|---|---|---|
| Broadcast | \(\log_2 P \cdot \alpha\) | \(\log_2 P \cdot \frac{n}{\beta}\) | Tree |
| Reduce | \(\log_2 P \cdot \alpha\) | \(\log_2 P \cdot \frac{n}{\beta}\) | Tree |
| Scatter | \(\log_2 P \cdot \alpha\) | \(\frac{P-1}{P} \cdot \frac{n}{\beta}\) | Binomial |
| Gather | \(\log_2 P \cdot \alpha\) | \(\frac{P-1}{P} \cdot \frac{n}{\beta}\) | Binomial |
| AllReduce | \(2(P-1) \cdot \alpha\) | \(2 \cdot \frac{P-1}{P} \cdot \frac{n}{\beta}\) | Ring |
| AllGather | \((P-1) \cdot \alpha\) | \(\frac{P-1}{P} \cdot \frac{n}{\beta}\) | Ring |
| ReduceScatter | \((P-1) \cdot \alpha\) | \(\frac{P-1}{P} \cdot \frac{n}{\beta}\) | Ring |
| All-to-All | \((P-1) \cdot \alpha\) | \(\frac{P-1}{P} \cdot \frac{n}{\beta}\) | Pairwise |
Algorithmic Bandwidth¶
The algorithmic bandwidth (algbw) measures data movement relative to a naive baseline.
Definition¶
This is the "effective" bandwidth: how fast data "appears to move" from the application's perspective.
Bus Bandwidth¶
For NCCL, bus bandwidth (busbw) accounts for the fact that data must traverse multiple links:
The correction factor depends on the collective:
| Collective | Correction Factor |
|---|---|
| Broadcast | 1 |
| Reduce | 1 |
| AllReduce | \(\frac{2(P-1)}{P}\) |
| AllGather | \(\frac{P-1}{P}\) |
| ReduceScatter | \(\frac{P-1}{P}\) |
| All-to-All (AlltoAll) | \(\frac{P-1}{P}\) |
Example Calculation¶
AllReduce of 1 GB across 8 GPUs takes 50ms.
Algorithmic bandwidth:
Bus bandwidth:
If your NIC is 400 Gbps = 50 GB/s, you're achieving 70% of peak—good performance!
Hierarchical Cost Analysis¶
Real clusters have network hierarchy. A DGX cluster might have:
- Intra-node: 8 GPUs connected via NVLink (900 GB/s per GPU)
- Inter-node: 8× 400 Gbps NICs (400 GB/s per node)
Two-Level AllReduce Model¶
For \(G\) GPUs per node, \(N\) nodes (\(P = GN\) total):
Phase 1: Intra-node ReduceScatter $\(T_1 = (G-1) \cdot \alpha_{\text{NV}} + \frac{G-1}{G} \cdot \frac{n}{\beta_{\text{NV}}}\)$
Each GPU now holds \(n/G\) of the partial result.
Phase 2: Inter-node AllReduce $\(T_2 = 2(N-1) \cdot \alpha_{\text{net}} + 2 \cdot \frac{N-1}{N} \cdot \frac{n/G}{\beta_{\text{net}}}\)$
Only \(n/G\) bytes cross the network per GPU!
Phase 3: Intra-node AllGather $\(T_3 = (G-1) \cdot \alpha_{\text{NV}} + \frac{G-1}{G} \cdot \frac{n}{\beta_{\text{NV}}}\)$
Total:
Numerical Example¶
Setup: 8 nodes × 8 GPUs = 64 GPUs, AllReduce 2 GB
Parameters:
- \(\alpha_{\text{NV}} = 1 \mu s\), \(\beta_{\text{NV}} = 300\) GB/s (NVSwitch)
- \(\alpha_{\text{net}} = 5 \mu s\), \(\beta_{\text{net}} = 50\) GB/s (400 GbE)
Phase 1 (intra-node RS):
Phase 2 (inter-node AR of 256 MB per GPU):
Phase 3 (intra-node AG):
Total hierarchical: \(T_{\text{hier}} = 5.81 + 8.82 + 5.81 = 20.44\) ms
Compare to flat ring (64 GPUs, bottleneck is network):
Hierarchical is 3.9× faster because only ⅛ of data crosses the slow network!
The Efficiency Gap¶
In practice, achieved performance is less than theoretical. The gap comes from:
1. Software Overhead¶
NCCL, driver, and kernel launch add latency beyond \(\alpha\):
Typical software overhead: 10-50 μs per operation.
2. Protocol Overhead¶
Headers, checksums, retransmissions add bandwidth overhead:
Typical overhead: 5-15% of raw bandwidth.
3. Contention¶
Multiple collectives or traffic from other jobs:
Shared clusters often see 30-50% bandwidth reduction.
4. Memory Copy¶
GPU memory to NIC staging buffers adds time:
For 1 GB over PCIe Gen5: \(\frac{10^9}{64 \times 10^9} \approx 16\) ms
5. Synchronization¶
BSP model requires all processes to wait for slowest:
Straggler effects add 5-20% typically.
Measuring Parameters¶
Latency (\(\alpha\))¶
Measure round-trip time for tiny messages:
import torch.distributed as dist
import time
def measure_alpha(iterations=1000):
tensor = torch.zeros(1, device='cuda')
# Warmup
for _ in range(100):
dist.all_reduce(tensor)
# Measure
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iterations):
dist.all_reduce(tensor)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
# For ring AllReduce: T ≈ 2(P-1)α for tiny messages
P = dist.get_world_size()
alpha = elapsed / (iterations * 2 * (P - 1))
return alpha # seconds
Bandwidth (\(\beta\))¶
Measure throughput for large messages:
def measure_beta(size_gb=1.0, iterations=10):
size = int(size_gb * 1e9 / 4) # float32 elements
tensor = torch.zeros(size, device='cuda')
# Warmup
for _ in range(3):
dist.all_reduce(tensor)
# Measure
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iterations):
dist.all_reduce(tensor)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
# For ring AllReduce: T ≈ 2(P-1)/P * n/β for large messages
P = dist.get_world_size()
n = size_gb * 1e9 # bytes
factor = 2 * (P - 1) / P
beta = factor * n * iterations / elapsed
return beta # bytes/second
NCCL-tests¶
The standard tool for collective benchmarking:
# Build
git clone https://github.com/NVIDIA/nccl-tests.git
cd nccl-tests && make
# Run AllReduce benchmark
./build/all_reduce_perf -b 1M -e 1G -f 2 -g 8
# Output interpretation:
# busbw: bus bandwidth (accounting for algorithm)
# algbw: algorithmic bandwidth (raw n/T)
Predicting Training Communication Time¶
For a training step, total communication includes:
Data Parallel AllReduce¶
Where \(n_{\text{grad}}\) = total gradient size (≈ model parameters × 4 bytes for FP32).
ZeRO-3 AllGather + ReduceScatter¶
Per layer, twice (forward + backward):
Tensor Parallel AllReduce¶
Per transformer layer:
(2 AllReduce in forward, 2 in backward per layer)
Pipeline Parallel Send/Recv¶
Per micro-batch:
Worked Example: 70B Model¶
Setup: 70B parameters, FP16, 512 GPUs with 8-way TP, 8-way DP, 8-way PP
Parameters:
- Gradient size: 70B × 2 bytes = 140 GB (but sharded 8×, so 17.5 GB per DP group)
- TP activation: 8192 × 4096 × 2 = 64 MB per layer
- PP activation: 8192 × 4096 × 2 = 64 MB per micro-batch
- 80 layers, 8 micro-batches
TP communication (8 GPUs, NVLink β = 300 GB/s):
DP communication (8 GPUs, network β = 50 GB/s):
PP communication (8 stages, 8 μbatches):
Total communication estimate: 123 + 613 + 21 = 757 ms
If compute time is 1500 ms, communication overhead is 50%—needs optimization!
Optimization Strategies¶
Given the cost model, we can reason about optimizations:
Overlap Communication and Compute¶
Hide \(T_{\text{comm}}\) behind \(T_{\text{compute}}\):
Instead of:
Reduce Message Size¶
- Gradient compression: Reduce \(n\) by 10-100×
- Mixed precision: FP16 vs FP32 halves \(n\)
- Gradient accumulation: Fewer AllReduce calls
Increase Effective Bandwidth¶
- Better algorithms: Ring vs tree selection
- Hierarchical collectives: Exploit topology
- Bucket fusion: Amortize latency
Reduce Parallelism Degree¶
If \(P\) is large and messages are small:
Consider: smaller DP group, more TP/PP.
Validation: Predict vs Measure¶
Always validate your model against measurements.
Methodology¶
- Measure α and β on your specific cluster
- Predict time using formulas
- Measure actual time with profiler
- Compute error: \(\frac{|T_{\text{pred}} - T_{\text{actual}}|}{T_{\text{actual}}}\)
Acceptable Error¶
- < 10%: Excellent model fit
- 10-30%: Useful for planning
-
30%: Model assumptions violated
Common Causes of Large Error¶
- Contention: Other jobs on shared network
- Wrong topology: α, β measured on different path
- Protocol effects: Small message protocol overhead
- GPU memory pressure: Thrashing affects copy bandwidth
Next
This chapter assumed a uniform network. The next chapter, Topology-Aware Collectives, extends the model to hierarchical topologies and contention-aware scheduling.
Exercises¶
-
Formula verification: For P=16, n=100 MB, α=10μs, β=100 GB/s, calculate:
-
AllReduce time (ring algorithm)
- AllGather time
- Speedup of hierarchical (4 nodes × 4 GPUs) vs flat
Solution
Given parameters:
- P = 16 GPUs
- n = 100 MB = 10⁸ bytes
- α = 10 μs = 10⁻⁵ s
- β = 100 GB/s = 10¹¹ bytes/s
Part 1: AllReduce time (ring algorithm)
Using the ring AllReduce formula:
Latency term:
Bandwidth term:
Part 2: AllGather time
Using the ring AllGather formula:
Part 3: Hierarchical (4×4) vs Flat speedup
Flat ring (P=16): Already calculated: T_flat = 2.175 ms
Hierarchical (4 nodes × 4 GPUs per node):
Assuming same α and β for simplicity (in practice, NVLink would be faster intra-node):
Phase 1: Intra-node ReduceScatter (G=4): $\(T_1 = (4-1) \times 10^{-5} + \frac{3}{4} \times \frac{10^8}{10^{11}} = 0.03 + 0.75 = 0.78 \text{ ms}\)$
Phase 2: Inter-node AllReduce (N=4, data = n/G = 25 MB): $\(T_2 = 2(4-1) \times 10^{-5} + 2 \times \frac{3}{4} \times \frac{2.5 \times 10^7}{10^{11}}\)$
Phase 3: Intra-node AllGather (G=4): $\(T_3 = (4-1) \times 10^{-5} + \frac{3}{4} \times \frac{10^8}{10^{11}} = 0.78 \text{ ms}\)$
Speedup: $\(\text{Speedup} = \frac{T_{\text{flat}}}{T_{\text{hier}}} = \frac{2.175}{1.995} = \boxed{1.09\times}\)$
Note: With uniform α and β, hierarchical shows modest speedup. The real benefit appears when network β is much slower than NVLink β (e.g., 50 GB/s vs 300 GB/s), giving speedups of 3-4×.
-
Efficiency analysis: Your 8-GPU AllReduce of 1 GB takes 80ms. Network is 400 Gbps. Calculate:
-
Algorithmic bandwidth
- Bus bandwidth
- Efficiency vs theoretical peak
Solution
Given:
- P = 8 GPUs
- n = 1 GB = 10⁹ bytes
- T = 80 ms = 0.08 s
- Network = 400 Gbps = 50 GB/s
Part 1: Algorithmic bandwidth
Part 2: Bus bandwidth
For AllReduce, the correction factor is \(\frac{2(P-1)}{P}\):
Part 3: Efficiency vs theoretical peak
The theoretical peak is the network bandwidth:
Analysis:
| Metric | Value |
|---|---|
| Algorithmic bandwidth | 12.5 GB/s |
| Bus bandwidth | 21.875 GB/s |
| Peak bandwidth | 50 GB/s |
| Efficiency | 43.75% |
This efficiency is concerning. Possible causes:
- Software overhead: NCCL kernel launch, synchronization
- Protocol overhead: Headers, checksums reducing effective bandwidth
- Contention: Other traffic on shared network
- Memory copies: PCIe bottleneck between GPU and NIC
Expected efficiency for well-optimized systems: 70-85%
To investigate, run NCCL-tests with varying message sizes to see if the issue is latency-bound (small messages) or bandwidth-bound (large messages).
-
Training time prediction: A 13B model has:
-
4B parameters in attention (TP)
- 9B parameters in FFN (TP)
- Sequence length 4096, hidden 5120, batch 2M tokens
- 64 GPUs: 8 TP × 8 DP
Estimate communication time per step. Which parallelism dominates?
Solution
Setup:
- 13B total parameters (4B attention + 9B FFN)
- TP = 8, DP = 8 (64 total GPUs)
- Sequence S = 4096, Hidden H = 5120
- Batch = 2M tokens → micro-batch per DP replica = 2M/8 = 250K tokens
- Per-GPU batch: B = 250K / 4096 ≈ 61 sequences
Assume:
- NVLink β = 300 GB/s (intra-node for TP)
- Network β = 50 GB/s (inter-node for DP)
- α_NV = 1 μs, α_net = 5 μs
Part 1: Tensor Parallel Communication
TP requires AllReduce of activations within each TP group.
Activation size per layer:
Per transformer layer: 4 AllReduce operations (2 forward, 2 backward)
Each AllReduce (ring, P=8, NVLink):
Assuming ~40 layers:
Part 2: Data Parallel Communication
DP requires AllReduce of gradients across DP groups.
Gradient size: 13B × 2 bytes (FP16) = 26 GB total
But with TP=8, each TP group only has 26/8 = 3.25 GB of unique gradients to sync.
AllReduce across DP=8 groups (network):
Summary:
| Parallelism | Communication Time | Percentage |
|---|---|---|
| Tensor Parallel | 2,384 ms | 95.4% |
| Data Parallel | 114 ms | 4.6% |
| Total | 2,498 ms | 100% |
Analysis:
TP dominates because: 1. Activation tensors are large (2.56 GB per AllReduce) 2. 4 AllReduce ops per layer × 40 layers = 160 operations 3. Even with fast NVLink, sheer volume is massive
Optimizations to consider:
- Sequence parallelism: Reduce activation size per GPU
- Activation checkpointing: Trade compute for memory (doesn't help communication directly)
- Reduce TP degree: TP=4 instead of TP=8 would halve TP communication, but increase per-GPU memory
- Overlap TP AllReduce with compute: Use async collectives
-
Hierarchical benefit: You're choosing between:
-
64 GPUs: 8 nodes × 8 GPUs
- 64 GPUs: 16 nodes × 4 GPUs
For 4 GB AllReduce with NVLink 300 GB/s, network 50 GB/s, which is faster?
Solution
Given:
- n = 4 GB = 4 × 10⁹ bytes
- β_NV = 300 GB/s (NVLink, intra-node)
- β_net = 50 GB/s (network, inter-node)
- α_NV = 1 μs, α_net = 5 μs
Configuration A: 8 nodes × 8 GPUs (G=8, N=8)
Phase 1: Intra-node ReduceScatter (G=8): $\(T_1 = (8-1) \times 10^{-6} + \frac{7}{8} \times \frac{4 \times 10^9}{3 \times 10^{11}}\)$
Phase 2: Inter-node AllReduce (N=8, data = 4GB/8 = 500 MB per GPU): $\(T_2 = 2(8-1) \times 5 \times 10^{-6} + 2 \times \frac{7}{8} \times \frac{5 \times 10^8}{5 \times 10^{10}}\)$
Phase 3: Intra-node AllGather (G=8): $\(T_3 = (8-1) \times 10^{-6} + \frac{7}{8} \times \frac{4 \times 10^9}{3 \times 10^{11}} = 11.67 \text{ ms}\)$
Configuration B: 16 nodes × 4 GPUs (G=4, N=16)
Phase 1: Intra-node ReduceScatter (G=4): $\(T_1 = (4-1) \times 10^{-6} + \frac{3}{4} \times \frac{4 \times 10^9}{3 \times 10^{11}}\)$
Phase 2: Inter-node AllReduce (N=16, data = 4GB/4 = 1 GB per GPU): $\(T_2 = 2(16-1) \times 5 \times 10^{-6} + 2 \times \frac{15}{16} \times \frac{10^9}{5 \times 10^{10}}\)$
Phase 3: Intra-node AllGather (G=4): $\(T_3 = 10 \text{ ms}\)$
Comparison:
| Configuration | Phase 1 | Phase 2 | Phase 3 | Total |
|---|---|---|---|---|
| 8×8 (A) | 11.67 ms | 17.57 ms | 11.67 ms | 40.91 ms |
| 16×4 (B) | 10 ms | 37.65 ms | 10 ms | 57.65 ms |
Why 8×8 wins:
- More GPUs per node → smaller data crosses slow network
- 8×8: 500 MB per GPU crosses network
-
16×4: 1 GB per GPU crosses network (2× more!)
-
Fewer nodes → fewer inter-node AllReduce steps
- 8×8: N=8 → 2×7 = 14 latency steps
- 16×4: N=16 → 2×15 = 30 latency steps
Key insight: Maximize GPUs per node to minimize network traffic. The network is the bottleneck, so keeping more computation intra-node pays off.
- Overlap potential: Compute takes 2000ms, communication takes 600ms. If you can overlap 80% of communication, what's the speedup?
Solution
Given:
- \(T_{\text{compute}} = 2000\) ms
- \(T_{\text{comm}} = 600\) ms
- Overlap fraction = 80%
Without overlap (sequential execution):
With 80% overlap:
The overlapped portion runs concurrently with compute. Only the non-overlapped portion adds to total time:
Speedup:
Alternative view - time saved:
Analysis:
| Scenario | Total Time | Speedup |
|---|---|---|
| No overlap (0%) | 2600 ms | 1.00× |
| 80% overlap | 2120 ms | 1.23× |
| 100% overlap | 2000 ms | 1.30× |
Theoretical maximum speedup (perfect overlap):
We achieve \(\frac{1.23 - 1.00}{1.30 - 1.00} = 77\%\) of the theoretical maximum benefit.
Practical considerations:
- Overlap techniques: Gradient bucketing, async AllReduce, pipelining
- Why not 100%?: Some operations require synchronization (e.g., first/last layers, optimizer step)
- Memory trade-off: Overlapping requires buffering gradients during communication
-
Parameter measurement: Design an experiment to separately measure:
-
GPU-GPU latency (same node)
- GPU-GPU latency (different nodes)
- NVLink bandwidth
- Network bandwidth
Solution
Experiment Design for α-β Parameter Measurement
The key insight: use tiny messages to isolate latency (α) and large messages to isolate bandwidth (β).
Experiment 1: GPU-GPU Latency (Same Node) - α_NV
import torch
import torch.distributed as dist
import time
def measure_intra_node_latency(iterations=10000):
"""Measure NVLink latency using tiny AllReduce."""
# Tiny tensor - bandwidth term negligible
tensor = torch.zeros(1, device='cuda')
# Warmup
for _ in range(100):
dist.all_reduce(tensor)
torch.cuda.synchronize()
# Measure
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iterations):
dist.all_reduce(tensor)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
# For ring AllReduce: T ≈ 2(P-1)α when n→0
P = dist.get_world_size() # Should be GPUs on same node
alpha_nv = elapsed / (iterations * 2 * (P - 1))
return alpha_nv * 1e6 # Return in microseconds
Run configuration: Single node, all 8 GPUs, NCCL with NVLink only.
Expected result: α_NV ≈ 1-5 μs
Experiment 2: GPU-GPU Latency (Different Nodes) - α_net
def measure_inter_node_latency(iterations=10000):
"""Measure network latency using tiny AllReduce across nodes."""
# Use ONLY one GPU per node to isolate network latency
tensor = torch.zeros(1, device='cuda')
# Warmup
for _ in range(100):
dist.all_reduce(tensor)
torch.cuda.synchronize()
# Measure
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iterations):
dist.all_reduce(tensor)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
N = dist.get_world_size() # Number of nodes
alpha_net = elapsed / (iterations * 2 * (N - 1))
return alpha_net * 1e6 # microseconds
Run configuration: 1 GPU per node, multiple nodes, network only.
Expected result: α_net ≈ 5-20 μs (depends on network topology)
Experiment 3: NVLink Bandwidth - β_NV
def measure_nvlink_bandwidth(size_gb=2.0, iterations=20):
"""Measure NVLink bandwidth using large AllReduce."""
size = int(size_gb * 1e9 / 4) # float32 elements
tensor = torch.zeros(size, device='cuda')
# Warmup
for _ in range(3):
dist.all_reduce(tensor)
torch.cuda.synchronize()
# Measure
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iterations):
dist.all_reduce(tensor)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
# For large n: T ≈ 2(P-1)/P × n/β (latency negligible)
P = dist.get_world_size()
n = size_gb * 1e9 # bytes
factor = 2 * (P - 1) / P
# β = factor × n × iterations / elapsed
beta_nv = factor * n * iterations / elapsed
return beta_nv / 1e9 # Return in GB/s
Run configuration: Single node, all 8 GPUs, NVLink only.
Expected result: β_NV ≈ 200-300 GB/s per GPU (NVSwitch)
Experiment 4: Network Bandwidth - β_net
def measure_network_bandwidth(size_gb=4.0, iterations=10):
"""Measure network bandwidth using large AllReduce across nodes."""
size = int(size_gb * 1e9 / 4)
tensor = torch.zeros(size, device='cuda')
# Warmup
for _ in range(3):
dist.all_reduce(tensor)
torch.cuda.synchronize()
# Measure
torch.cuda.synchronize()
start = time.perf_counter()
for _ in range(iterations):
dist.all_reduce(tensor)
torch.cuda.synchronize()
elapsed = time.perf_counter() - start
N = dist.get_world_size()
n = size_gb * 1e9
factor = 2 * (N - 1) / N
beta_net = factor * n * iterations / elapsed
return beta_net / 1e9 # GB/s
Run configuration: 1 GPU per node (to avoid NVLink contribution), multiple nodes.
Expected result: β_net ≈ 40-50 GB/s per GPU (400 GbE)
Complete Measurement Protocol
| Parameter | Configuration | Message Size | Iterations |
|---|---|---|---|
| α_NV | 8 GPUs, 1 node | 4 bytes | 10,000 |
| α_net | 1 GPU/node, N nodes | 4 bytes | 10,000 |
| β_NV | 8 GPUs, 1 node | 2 GB | 20 |
| β_net | 1 GPU/node, N nodes | 4 GB | 10 |
Validation steps:
- Consistency check: Run each experiment 5 times, report mean ± std
- Size sweep: Vary message size from 1KB to 10GB, plot T vs n
- Fit α-β model: Linear regression on T = α + n/β
- Compare to spec: NVLink should be ~6× network bandwidth
Example results table:
| Parameter | Measured | Expected | Status |
|---|---|---|---|
| α_NV | 2.3 μs | 1-5 μs | ✓ |
| α_net | 8.7 μs | 5-20 μs | ✓ |
| β_NV | 285 GB/s | 250-300 GB/s | ✓ |
| β_net | 47 GB/s | 40-50 GB/s | ✓ |
Key Takeaways¶
-
Formulas exist for all collectives: Use α-β model with algorithm-specific factors.
-
Bus bandwidth normalizes comparisons: Accounts for algorithmic data amplification.
-
Hierarchy dramatically reduces network load: Factor of G (GPUs per node) reduction.
-
Real systems have overhead: Plan for 70-80% of theoretical peak.
-
Predict then measure: Validate your model on actual infrastructure.
-
Communication often dominates at scale: Understanding costs enables optimization.