Hardware Assumptions and Units
This book uses back-of-envelope estimates. To keep calculations consistent, we use the following defaults unless stated otherwise. Most examples target H100 SXM; the table below provides reference values for other common accelerators so readers can adapt calculations to their own hardware.
Time-sensitive hardware references
Hardware specifications and interconnect offerings evolve quickly and can vary by SKU, cloud provider, and firmware. Last verified: 2026-02-12. Confirm current vendor datasheets and instance documentation for production planning.
Accelerator Reference Table¶
| Spec | A100 80 GB SXM | H100 80 GB SXM | H200 141 GB SXM | AMD MI300X 192 GB |
|---|---|---|---|---|
| HBM capacity | 80 GB (HBM2e) | 80 GB (HBM3) | 141 GB (HBM3e) | 192 GB (HBM3) |
| HBM bandwidth | ~2.0 TB/s | ~3.35 TB/s | ~4.8 TB/s | ~5.3 TB/s |
| Dense BF16/FP16 peak | ~312 TFLOP/s | ~989 TFLOP/s | ~989 TFLOP/s | ~1,307 TFLOP/s |
| Dense FP8 peak | N/A | ~1,979 TFLOP/s | ~1,979 TFLOP/s | ~2,615 TFLOP/s |
| TDP | 400 W | 700 W | 700 W | 750 W |
| Intra-node interconnect | NVLink 3.0 | NVLink 4.0 | NVLink 4.0 | Infinity Fabric |
| Intra-node BW (per GPU) | ~600 GB/s | ~900 GB/s | ~900 GB/s | ~896 GB/s |
Notes: TFLOP/s values are non-sparse. H200 shares the H100 compute die but has more and faster HBM. MI300X FLOP/s figures are AMD's published peak; achieved rates depend on ROCm software maturity. All bandwidth figures are aggregate bidirectional.
Default Values Used in Examples¶
Throughout the book, unless a section explicitly states otherwise:
- GPU: NVIDIA H100 SXM 80 GB
- Dense BF16/FP16 peak: ~989 TFLOP/s (non-sparse)
- Dense FP8 peak: ~1,979 TFLOP/s (non-sparse)
- MFU is defined relative to the stated peak in the local context
If a section uses a different precision (e.g., FP8 or sparsity), it says so explicitly.
Interconnect Bandwidth¶
| Link | Bandwidth (per GPU) | Typical Latency | Notes |
|---|---|---|---|
| NVLink 3.0 (A100) | ~600 GB/s | ~1 μs | 12 links × 50 GB/s |
| NVLink 4.0 (H100/H200) | ~900 GB/s | ~1 μs | 18 links × 50 GB/s |
| InfiniBand NDR 400 | ~50 GB/s | ~1–2 μs | Per NIC, per direction |
| InfiniBand NDR 200 | ~25 GB/s | ~1–2 μs | Common in older clusters |
| 100 GbE (RoCE v2) | ~12.5 GB/s | ~5–10 μs | Some cloud providers |
| PCIe Gen5 x16 | ~64 GB/s | ~1 μs | CPU↔GPU, bidirectional |
When we compute ridge points, we divide peak FLOP/s by the stated link bandwidth. If a section uses per-direction instead of aggregate, it calls that out explicitly.
Typical DGX/SuperPOD Configurations¶
| Configuration | GPUs/Node | Intra-node | Inter-node | Example |
|---|---|---|---|---|
| DGX A100 | 8 × A100 | NVLink 3.0 (600 GB/s) | 8 × HDR 200 IB | Many cloud instances |
| DGX H100 | 8 × H100 | NVLink 4.0 (900 GB/s) | 8 × NDR 400 IB | Meta Grand Teton |
| DGX H200 | 8 × H200 | NVLink 4.0 (900 GB/s) | 8 × NDR 400 IB | Latest generation |
Units¶
- GB and TB denote decimal units (1 GB = 10^9 bytes)
- GiB and TiB denote binary units (1 GiB = 2^30 bytes)
- Tokens are counted in raw tokens unless otherwise stated
If you need exact provisioning, convert these estimates to your platform's reporting units (often GiB).