Estimation as Discipline
Before running expensive experiments, estimate. A capacity engineer should predict training time, memory usage, and communication costs within 2× of actual values using only basic arithmetic.
The Question: You're planning to train a 30B parameter model on 128 H100s. Without running anything, estimate: (1) memory per GPU, (2) time per step, (3) tokens per second. Your estimates should guide hardware selection before spending a dollar.
The Estimation Mindset¶
Good estimates are:
- Fast: 5 minutes of calculation, not 5 days of profiling
- Approximate: Within 2× is usually sufficient for planning
- Conservative: Overestimate costs, underestimate throughput
The goal is to identify infeasible configurations quickly and focus experimentation on viable options.
Practice
Estimates here use idealized peak values. Always sanity-check against profiler measurements on your actual stack.
See Hardware Assumptions and Units for the default throughput and bandwidth values used in examples.
Memory Estimation¶
Model Memory¶
For a transformer with:
- \(L\) layers
- \(H\) hidden dimension
- \(V\) vocabulary size
- \(A\) attention heads
Total parameters:
Memory for parameters (mixed precision):
Memory for optimizer (Adam, FP32):
Total static memory:
Activation Memory¶
Per-layer activation memory (without checkpointing):
Where \(B\) = batch, \(S\) = sequence, \(H\) = hidden, \(A\) = heads. This coefficient is a rough heuristic based on counting common activation tensors.
Total activation memory:
With activation checkpointing (recompute every \(k\) layers), peak memory has two components: (1) checkpoint storage at \(L/k\) boundaries (just the hidden-state input, \(\approx 2BSH\) bytes each), and (2) full per-layer activations for up to \(k\) layers during recomputation:
The optimal \(k = \sqrt{L}\) minimizes total memory. As a simpler (more conservative) approximation used throughout this chapter:
This overestimates the checkpoint term (using full \(M_{\text{act}}^{\text{layer}}\) instead of just \(2BSH\)) but is easier to compute.
Quick Estimation Table¶
| Component | Formula | 70B Model |
|---|---|---|
| Parameters | \(2\Psi\) | 140 GB |
| Gradients | \(2\Psi\) | 140 GB |
| Optimizer | \(12\Psi\) | 840 GB |
| Total Static | \(16\Psi\) | 1.12 TB |
This immediately tells us: 70B requires ≥14 H100s just for static memory.
Compute Estimation¶
FLOPs per Token¶
Forward pass:
Backward pass:
Total per token:
Time per Step¶
Where:
- \(F_{\text{step}} = 6\Psi \cdot B \cdot S\) (batch × sequence tokens)
- MFU ≈ 0.40-0.50 for well-optimized training
- P = number of GPUs
Example: 70B model, batch=1M tokens, 128 H100s, 45% MFU:
Tokens per Second¶
Training Time¶
For 2T tokens:
Communication Estimation¶
Data Parallelism¶
AllReduce volume per step: \(2\Psi\) bytes
AllReduce time (ring, P GPUs, bandwidth \(\beta\)):
For 70B across 128 GPUs at 50 GB/s:
Tensor Parallelism¶
AllReduce per layer: \(2 \times B \times S \times H\) bytes
For 8-way TP, B=4, S=4096, H=8192:
Total per step (80 layers, 2 AllReduce each): ~48ms
Pipeline Parallelism¶
Bubble fraction:
Where \(p\) = pipeline stages, \(m\) = micro-batches.
For p=8, m=32: Bubble = 7/39 ≈ 18%
The Estimation Workflow¶
- Memory check: Does the model fit? How much parallelism is required?
- Compute estimate: What's the theoretical throughput?
- Communication estimate: What fraction of time is communication?
- Bottleneck identification: Which ceiling dominates?
- Sanity check: Compare to similar published runs
Common Estimation Errors¶
| Error | Consequence | Fix |
|---|---|---|
| Forgetting optimizer states | 3× memory underestimate | Always include 12Ψ |
| Ignoring activations | OOM during training | Account for batch × seq × hidden |
| Assuming 100% MFU | 2× time underestimate | Use 40-50% MFU |
| Ignoring communication | Works in theory, fails in practice | Add AllReduce/AllGather time |
Exercises¶
- Estimate the memory per GPU for a 13B model with TP=4, ZeRO-3 across 32 GPUs. Assume batch=8, sequence=4096, hidden=5120, layers=40.
Solution
Parallelism configuration:
- Total GPUs: 32
- TP = 4 (tensor parallel groups)
- DP = 32/4 = 8 (data parallel replicas)
- ZeRO-3 shards across DP dimension
Static memory (ZeRO-3 shards across 32 GPUs):
| Component | Formula | Per-GPU |
|---|---|---|
| Parameters | \(\frac{2\Psi}{32}\) | \(\frac{2 \times 13 \times 10^9}{32} = 0.81\) GB |
| Gradients | \(\frac{2\Psi}{32}\) | 0.81 GB |
| Optimizer | \(\frac{12\Psi}{32}\) | 4.87 GB |
| Total Static | 6.49 GB |
Activation memory:
Using formula: \(M_{act}^{layer} \approx BSH \times (34 + 5\frac{AS}{H})\)
Assuming 40 attention heads (\(A = 40\)):
With TP=4: Activations are distributed, reducing per-GPU cost by ~4×:
With activation checkpointing (checkpoint every 4 layers):
This is still too large, indicating that this batch/sequence combination is unrealistic without further parallelism, smaller batches, or reduced sequence length.
Total memory per GPU:
| Component | Memory |
|---|---|
| Static (ZeRO-3) | 6.5 GB |
| Activations (TP=4, checkpointing) | ~113 GB |
| Temporary buffers | ~5 GB |
| Total | ~125 GB |
This does not fit in 80 GB. Even with ZeRO-3 nearly eliminating static memory, the activation memory at this batch/sequence combination dominates. You'd need more TP/PP, smaller batch/sequence, or more aggressive recompute (e.g., full recomputation would reduce activation memory to ~4 × 8.1 = 32 GB, bringing the total to ~44 GB—which does fit).
- A training run achieves 150K tokens/s on 64 H100s for a 7B model. Calculate the MFU.
Solution
FLOPs per token:
Achieved FLOPs/s:
Peak FLOPs/s (64 H100s):
MFU:
Analysis: This is a very low MFU, indicating significant inefficiency. Possible causes:
| Issue | Likely Impact |
|---|---|
| Small batch size | Underutilized compute |
| Excessive pipeline bubbles | Idle time between stages |
| Unoptimized kernels | Low SM utilization |
| Communication not overlapped | GPUs waiting for network |
Expected MFU for well-optimized 7B training: 40-50%
At 45% MFU, expected throughput would be:
The observed 150K is only 11% of this potential.
- You need to train a 30B model in 30 days on a budget of USD 2M. Estimate the minimum number of H100s required (at USD 4/hr) and the required MFU to meet the timeline.
Solution
Budget constraint:
Time constraint:
GPU constraint from budget:
Round to practical value: P = 640 GPUs (or 512 for power-of-2)
Training FLOPs (assuming Chinchilla-optimal 600B tokens):
Required throughput:
Peak throughput with 640 H100s:
Required MFU:
Conclusion: This is easily achievable—well-optimized runs achieve 40-50% MFU.
Alternative: Train for more tokens (2T)
Still very achievable with standard optimization.
Summary:
| Scenario | Tokens | GPUs | Required MFU |
|---|---|---|---|
| Chinchilla (600B) | 600B | 640 | 6.6% |
| Extended (2T) | 2T | 640 | 22% |
| Budget-limited | 2T | 512 | ~28% |
Key Takeaways¶
- Estimation is a first-class skill: rough numbers prevent impossible plans.
- Budget, time, and MFU are interchangeable: any two determine the third.
- Always sanity-check against hardware limits: peak FLOPs and memory ceilings bound every plan.