Topology-Aware Collectives
Collectives are algebraic. Networks are not. The same AllReduce can be optimal on one topology and terrible on another. Topology-aware collectives reconcile these two realities by choosing algorithms that respect the hardware hierarchy.
The Question: You have 8 GPUs per node, 512 nodes, and a 10GB gradient. Your inter-node links are 10x slower than NVLink. Which AllReduce should you run, and why does the optimal algorithm change with topology?
Chapter Map
Prerequisites: Chapter 13 (collective cost formulas)
Key insight: Algebraic properties give freedom to regroup reductions, but topology decides which regrouping is optimal. A topology-aware model replaces the single \(\alpha, \beta\) with per-level costs and a congestion factor.
The Non-Uniform Network¶
The cost model in the previous chapter assumed a single, uniform network. Real clusters are hierarchical:
- Intra-node: GPUs connected by NVLink/NVSwitch (very high bandwidth, low latency)
- Inter-node: NICs and switches (lower bandwidth, higher latency)
This hierarchy turns one logical collective into multiple physical collectives at different levels.
flowchart TB
subgraph nodeA["Node A (8 GPUs)"]
A1((GPU)) --- A2((GPU)) --- A3((GPU)) --- A4((GPU))
end
subgraph nodeB["Node B (8 GPUs)"]
B1((GPU)) --- B2((GPU)) --- B3((GPU)) --- B4((GPU))
end
nodeA ---|NIC| spine((Network Spine))
nodeB ---|NIC| spine
Algebra Gives Freedom, Topology Sets Cost¶
Associativity and commutativity let you regroup reductions:
You can reduce within nodes first, then across nodes, or the reverse. Both are correct. Only one is fast.
A Topology-Aware Cost Model¶
Represent the cluster as a graph. Each edge \(e\) has:
- Bandwidth \(B_e\) (bytes/s)
- Latency \(\alpha_e\) (s)
For any collective schedule, define the edge load \(L_e\) as the total bytes sent over edge \(e\).
Bandwidth lower bound:
Latency term (steps of the schedule):
This makes the key idea explicit: the slowest, most loaded edge dominates the runtime.
Contention Factor¶
If multiple collectives share the same edges, effective bandwidth drops. Model this with a contention factor \(C \ge 1\):
When \(C\) grows, topology-aware algorithms that reduce load on slow edges become essential.
Deriving Hierarchical AllReduce¶
Associativity lets us reduce locally before going global. For \(G\) GPUs per node and \(N\) nodes (\(P = GN\)):
- Intra-node ReduceScatter
- Inter-node AllReduce
- Intra-node AllGather
With ring algorithms at each level:
The inter-node term is smaller by a factor of \(G\), because only \(n/G\) bytes leave each node.
Generalizing to L Levels¶
If the topology has multiple levels (e.g., GPU, socket, node, rack), apply the same logic repeatedly:
Each level reduces the data before the next, shrinking traffic on the slowest links.
Congestion Turns Into a Design Constraint¶
Oversubscribed networks create bottlenecks. Define the oversubscription ratio:
When \(\sigma > 1\), any algorithm that sends full-volume traffic across the spine will be contention-bound. Hierarchical collectives avoid this by reducing data volume before it hits the bottleneck.
Decision Procedure: Choosing Algorithms by Level¶
- Identify topology levels (GPU, node, rack) and their \((\alpha, \beta)\).
- Compute per-level message sizes (\(n\), then \(n/G\), then \(n/(G \cdot R)\), ...).
- Choose algorithm per level:
- Large messages: ring
- Small messages: tree
- Estimate edge loads and check for oversubscription.
- If contention dominates, reduce data before it crosses slow links.
This is the collective analog of the device mesh abstraction you will meet in Chapter 23.
Case Study: 8 Nodes x 8 GPUs, 10GB Gradient¶
Assumptions: - \(G = 8\), \(N = 8\), \(n = 10\) GB - \(\alpha_{\text{intra}} = 1\) us, \(\beta_{\text{intra}} = 600\) GB/s - \(\alpha_{\text{inter}} = 5\) us, \(\beta_{\text{inter}} = 50\) GB/s
Flat ring (64 GPUs, inter-node dominated):
Hierarchical ring:
Result: Hierarchical is about 5.4x faster.
Traffic check:
Inter-node traffic drops by 9x, which is why hierarchical AllReduce dominates on real clusters. If the inter-node fabric is oversubscribed, multiply the inter-node term by the contention factor \(C\).
Traffic Heatmap (Inter-Node Load)¶
Relative inter-node load, normalized to flat ring:
| Algorithm | Inter-node bytes per GPU | Relative load |
|---|---|---|
| Flat ring | 19.7 GB | ########## (1.00) |
| Hier ring | 2.19 GB | # (0.11) |
Flat vs Hierarchical Paths¶
In a flat ring, every hop may traverse inter-node links. In a hierarchical ring, inter-node traffic happens only between nodes, after intra-node reduction.
flowchart TB
subgraph flat["Flat ring (all GPUs)"]
direction LR
F1((GPU)) --> F2((GPU)) --> F3((GPU)) --> F4((GPU))
F4 --> F1
end
subgraph hier["Hierarchical ring (node-level)"]
direction TB
subgraph n1["Node A (intra-node reduce)"]
A1((GPU)) --- A2((GPU)) --- A3((GPU)) --- A4((GPU))
end
subgraph n2["Node B (intra-node reduce)"]
B1((GPU)) --- B2((GPU)) --- B3((GPU)) --- B4((GPU))
end
subgraph n3["Node C (intra-node reduce)"]
C1((GPU)) --- C2((GPU)) --- C3((GPU)) --- C4((GPU))
end
n1 == inter-node == n2
n2 == inter-node == n3
n3 == inter-node == n1
end
Toy Example: 2 Nodes x 2 GPUs¶
Assume each GPU contributes \(n = 1\) GB to an AllReduce.
Flat ring (P=4):
Hierarchical ring (G=2, N=2):
| Metric | Flat ring | Hier ring |
|---|---|---|
| Inter-node bytes per GPU | 1.5 GB | 0.5 GB |
| Relative load | 1.00 | 0.33 |
Edge-Load Table (Toy Example)¶
Assume a two-node cluster with a single inter-node link between Node A and Node B.
| Algorithm | Inter-node bytes per GPU | Total bytes on inter-node link |
|---|---|---|
| Flat ring (P=4) | 1.5 GB | \(4 \times 1.5 = 6\) GB |
| Hier ring (G=2, N=2) | 0.5 GB | \(4 \times 0.5 = 2\) GB |
This table shows the same 3x reduction in inter-node load, now expressed as total traffic over the single bottleneck link. It ignores full-duplex overlap and protocol overhead, so treat the numbers as lower-bound guidance.
Exercises¶
- For \(G=8\) and \(N=16\), derive the total time for hierarchical AllReduce using ring at each level.
Solution
1. Hierarchical ring (G=8, N=16):
Plugging in \(G=8, N=16\):
- Given \(\alpha_{\text{inter}} = 5e-6\) s, \(\beta_{\text{inter}} = 50\) GB/s, and \(N=8\), compute the inter-node message size \(n'\) where tree beats ring.
Solution
2. Tree vs ring crossover (inter-node):
Ring: $\(T_{\text{ring}} = 2(N-1)\alpha + 2\frac{N-1}{N}\frac{n'}{\beta}\)$
Tree: $\(T_{\text{tree}} = 2\log_2 N \cdot \alpha + 2\log_2 N \cdot \frac{n'}{\beta}\)$
Solve \(T_{\text{tree}} < T_{\text{ring}}\):
For \(N=8\), \(\alpha = 5e-6\) s, \(\beta = 50\) GB/s:
For messages smaller than \(\sim 0.47\) MB per GPU on the inter-node step, a tree is faster.
- If oversubscription is \(\sigma = 3\), how does it change your algorithm choice?
Solution
3. Oversubscription impact:
Effective inter-node bandwidth becomes:
So the inter-node term is multiplied by \(\sigma\). This makes it even more important to reduce data before it crosses the slow links, favoring hierarchical collectives and smaller inter-node groups.
- Using the case study numbers, compute the new \(T_{\text{hier}}\) when \(\sigma = 3\).
Solution
4. Case study with \(\sigma = 3\):
Only the inter-node term scales:
Still much faster than the flat ring baseline (~394 ms).