36 Hardware-Specific Optimization
Getting Maximum Performance from Different Accelerators
Code that’s optimal on an A100 may be suboptimal on an H100. What’s fast on NVIDIA may be slow on AMD.
Hardware diversity is increasing. Understanding hardware-specific optimization is essential.
36.1 The Hardware Landscape
The accelerator market is more diverse than ever:
Major ML Accelerators (2024-2025):
NVIDIA:
- B200/B100 (Blackwell): 192GB HBM3e, 8 TB/s, FP4 support
- H100/H200 (Hopper): 80-141GB HBM3, 3.35-4.8 TB/s, FP8 support
- A100 (Ampere): 40/80GB HBM2e, 2 TB/s
- L4/L40S: Inference-focused, lower power
AMD:
- MI300X: 192GB HBM3, 5.3 TB/s
- MI325X: 256GB HBM3e, 6 TB/s (2025)
Intel:
- Gaudi 2/3: HBM-based accelerators
- Data Center Max (Ponte Vecchio)
Apple:
- M3/M4 Max/Ultra: Unified memory, Metal API
Google:
- TPU v5e/v5p: Custom ASICs, XLA-optimized
- TPU v6: Trillium, 4× v5p performance
36.2 NVIDIA: A100 vs H100
36.2.1 Architectural Differences
A100 H100
─────────────────────────────────────────────────
SM Count 108 132
FP16 TFLOPS 312 989
FP8 TFLOPS N/A 1,979
Memory 80GB HBM2e 80GB HBM3
Bandwidth 2.0 TB/s 3.35 TB/s
NVLink 600 GB/s 900 GB/s
Key H100 additions:
- TMA (Tensor Memory Accelerator)
- FP8 Tensor Cores
- Thread Block Clusters
- Asynchronous execution barriers
36.2.2 H100-Specific Optimizations
1. Tensor Memory Accelerator (TMA)
TMA enables asynchronous, hardware-managed memory transfers:
# A100: Manual shared memory loading
@triton.jit
def a100_kernel(x_ptr, ...):
# Software orchestrates loads
x = tl.load(x_ptr + offsets, mask=mask)
# Wait for load to complete
# Compute...
# H100: TMA handles loads automatically
@triton.jit
def h100_kernel(x_ptr, ...):
# TMA descriptor specifies transfer pattern
# Hardware handles async copy + synchronization
# Compute can overlap with loadsPractical impact: FlashAttention-3 on H100 uses TMA for 2x speedup over FA2.
2. FP8 Compute
# H100 native FP8
import transformer_engine.pytorch as te
# FP8 linear layer
layer = te.Linear(4096, 4096)
with te.fp8_autocast():
output = layer(input) # Uses FP8 tensor cores
# Performance:
# FP16 on H100: 989 TFLOPS
# FP8 on H100: 1,979 TFLOPS (2x faster)3. Thread Block Clusters
H100 groups thread blocks for better cooperation:
# H100: Clusters can share data efficiently
# Up to 16 thread blocks per cluster
# Shared memory accessible across cluster
# Practical use: Larger tile sizes in matmul
# A100 tile: limited by single SM shared memory (164 KB)
# H100 tile: can span multiple SMs via cluster36.2.3 Optimizing for Both
def get_optimal_config():
"""Select config based on GPU architecture."""
props = torch.cuda.get_device_properties(0)
compute_cap = (props.major, props.minor)
if compute_cap >= (9, 0): # Hopper (H100)
return {
'use_fp8': True,
'tile_size': 256, # Larger tiles with TMA
'use_flash_attention_3': True,
}
elif compute_cap >= (8, 0): # Ampere (A100)
return {
'use_fp8': False,
'tile_size': 128,
'use_flash_attention_2': True,
}
else:
return {
'use_fp8': False,
'tile_size': 64,
'use_flash_attention_1': True,
}36.3 NVIDIA Blackwell: B100 and B200
36.3.1 The Next Generation (2024-2025)
Blackwell represents another generational leap from Hopper:
H100 B100 B200
─────────────────────────────────────────────────────────────────
Memory 80 GB HBM3 192 GB HBM3e 192 GB HBM3e
Bandwidth 3.35 TB/s 8 TB/s 8 TB/s
FP16 TFLOPS 989 ~1,800 ~2,250
FP8 TFLOPS 1,979 ~3,500 ~4,500
FP4 TFLOPS N/A ~7,000 ~9,000
NVLink 900 GB/s 1.8 TB/s 1.8 TB/s
TDP 700W 700W 1000W
Key Blackwell additions:
- 5th-gen Tensor Cores (2× Hopper)
- FP4 support for inference
- Enhanced TMA (larger tiles)
- NVLink 5 (2× bandwidth)
- Decompression engine for compressed models
36.3.2 FP4: The New Frontier
Blackwell introduces native FP4 (4-bit floating point):
# FP4 has very limited range: ~±6 with 2 mantissa bits
# Requires careful scaling
def fp4_linear(x, weight, scale_x, scale_w):
"""FP4 matrix multiply with per-tensor scaling."""
# Quantize to FP4
x_fp4 = quantize_fp4(x / scale_x)
w_fp4 = quantize_fp4(weight / scale_w)
# Compute in FP4 (uses Blackwell tensor cores)
y_fp4 = x_fp4 @ w_fp4.T # 2× faster than FP8
# Dequantize
return y_fp4.float() * scale_x * scale_w
# Performance:
# B200 FP4: ~9,000 TFLOPS
# B200 FP8: ~4,500 TFLOPS
# B200 FP16: ~2,250 TFLOPS36.3.3 Decompression Engine
Blackwell includes hardware for on-the-fly model decompression:
Traditional flow:
Storage → [CPU decompress] → GPU memory → Compute
Blackwell flow:
Storage → GPU memory (compressed) → [HW decompress] → Compute
Benefits:
- 2× effective memory capacity
- Reduced memory bandwidth
- Faster model loading
Supported formats: LZ4, ZSTD, custom trained codecs
36.3.4 Programming for Blackwell
def get_optimal_config():
"""Select config based on GPU architecture."""
props = torch.cuda.get_device_properties(0)
compute_cap = (props.major, props.minor)
if compute_cap >= (10, 0): # Blackwell (B100/B200)
return {
'use_fp4': True, # For inference
'use_fp8': True, # For training
'tile_size': 512, # Even larger tiles
'use_flash_attention_3': True, # FA3 optimized for Blackwell
'enable_compression': True,
}
elif compute_cap >= (9, 0): # Hopper (H100)
return {
'use_fp8': True,
'tile_size': 256,
'use_flash_attention_3': True,
}
elif compute_cap >= (8, 0): # Ampere (A100)
return {
'use_fp8': False,
'tile_size': 128,
'use_flash_attention_2': True,
}
else:
return {
'use_fp8': False,
'tile_size': 64,
'use_flash_attention_1': True,
}36.3.5 Blackwell Inference Optimization
# Transformer Engine for Blackwell
import transformer_engine.pytorch as te
# FP4 inference (Blackwell only)
with te.fp4_autocast():
# Weight compression happens automatically
output = model(input)
# Micro-batch scaling for FP4
# Each 16-element group has its own scale factor
# Maintains accuracy despite limited FP4 range36.3.6 Grace-Blackwell Superchip
The GB200 pairs Blackwell GPU with Grace CPU:
Grace-Blackwell Architecture:
┌─────────────────────────────────────────────────────┐
│ Grace CPU (72 Arm cores) │
│ - 480 GB LPDDR5X │
│ - 546 GB/s memory bandwidth │
│ - NVLink-C2C to GPU (900 GB/s) │
├─────────────────────────────────────────────────────┤
│ Blackwell GPU │
│ - 192 GB HBM3e │
│ - 8 TB/s memory bandwidth │
│ - NVLink to other GPUs │
└─────────────────────────────────────────────────────┘
Key benefit: NVLink between CPU and GPU
- No PCIe bottleneck
- Unified memory addressing
- 900 GB/s CPU-GPU bandwidth (vs 64 GB/s PCIe 5.0)
36.3.7 When to Choose Blackwell
Blackwell B200 best for:
- Maximum inference throughput (FP4)
- Very large models (192 GB HBM)
- Memory-bound training
- Multi-GPU scaling (NVLink 5)
H100 still good for:
- Existing deployments
- FP8 training without FP4 requirements
- Lower power budget (700W vs 1000W)
- Cost-sensitive applications
36.4 AMD MI300X
36.4.1 ROCm vs CUDA
AMD’s software stack differs from NVIDIA’s:
CUDA Concept → ROCm Equivalent
─────────────────────────────────────────────
CUDA HIP
cuBLAS rocBLAS
cuDNN MIOpen
NCCL RCCL
Tensor Cores Matrix Cores
36.4.2 HIP: Write Once, Run on Both
// HIP code runs on both AMD and NVIDIA
#include <hip/hip_runtime.h>
__global__ void add_kernel(float* a, float* b, float* c, int n) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < n) {
c[idx] = a[idx] + b[idx];
}
}
// Compile for AMD: hipcc -o add add.cpp
// Compile for NVIDIA: hipcc -o add add.cpp (uses CUDA backend)36.4.3 PyTorch on AMD
# PyTorch with ROCm
import torch
# Check for AMD GPU
if torch.cuda.is_available():
device = torch.device('cuda')
# Works the same as NVIDIA!
# But check device name for architecture-specific opts
props = torch.cuda.get_device_properties(0)
if 'MI300' in props.name:
# AMD MI300-specific optimizations
pass36.4.4 MI300X Advantages
MI300X vs H100:
Memory: 192 GB vs 80 GB (2.4x more!)
Bandwidth: 5.3 TB/s vs 3.35 TB/s (1.6x more)
FP16 TFLOPS: 1,307 vs 989 (1.3x more)
Best for:
- Large models that need memory (70B+ dense)
- Memory-bound workloads
- Batch inference
36.4.5 Challenges
MI300X challenges:
- Smaller software ecosystem
- Less optimized libraries (catching up)
- Fewer pre-built Docker images
- FlashAttention ports may lag
Mitigation:
- Use PyTorch's vendor-agnostic APIs when possible
- Test thoroughly on target hardware
- Use AMD's optimized containers
36.5 Apple Silicon
36.5.1 Unified Memory Architecture
Apple M3 Max/Ultra:
Memory: Up to 128GB unified (CPU + GPU share)
Bandwidth: 400 GB/s
Neural Engine: 18 TOPS
Key difference: No discrete GPU memory
- No CPU→GPU transfer overhead
- But lower raw bandwidth than HBM
36.5.2 Metal Performance Shaders
# PyTorch with MPS backend
import torch
if torch.backends.mps.is_available():
device = torch.device('mps')
model = model.to(device)
# Or use MLX (Apple's native framework)
import mlx.core as mx
import mlx.nn as nn
# MLX is designed for Apple Silicon
x = mx.array([1, 2, 3])
y = mx.exp(x) # Runs on GPU automatically36.5.3 MLX for Native Performance
# MLX: Apple's ML framework optimized for Apple Silicon
import mlx.core as mx
import mlx.nn as nn
from mlx.utils import tree_flatten
class MLP(nn.Module):
def __init__(self, in_dim, hidden_dim, out_dim):
super().__init__()
self.fc1 = nn.Linear(in_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, out_dim)
def __call__(self, x):
x = mx.maximum(self.fc1(x), 0) # ReLU
return self.fc2(x)
# Lazy evaluation + unified memory = efficient
model = MLP(784, 256, 10)
x = mx.random.normal((32, 784))
y = model(x)
mx.eval(y) # Actually compute36.5.4 Apple Silicon Best Practices
# 1. Use unified memory wisely
# No explicit transfers needed, but be mindful of memory pressure
# 2. Batch appropriately
# Lower bandwidth means smaller optimal batch sizes
# 3. Use Metal-optimized operations
# Core ML, MPS, MLX are optimized; pure Python is not
# 4. Quantization helps a lot
# Memory bandwidth is the bottleneck
# INT4 models run faster than FP1636.6 Google TPU
36.6.1 TPU Architecture
TPU v5p:
MXUs (Matrix Multiply Units): Main compute
HBM: 95 GB per chip
ICI (Inter-Chip Interconnect): High-speed mesh
Key differences from GPU:
- 128×128 systolic arrays (vs GPU's tensor cores)
- Designed for matrix operations
- XLA compilation required
- Different memory model
36.6.2 JAX and XLA
import jax
import jax.numpy as jnp
from flax import linen as nn
class Transformer(nn.Module):
@nn.compact
def __call__(self, x):
x = nn.Dense(512)(x)
x = nn.relu(x)
x = nn.Dense(512)(x)
return x
# JIT compile with XLA
@jax.jit
def forward(params, x):
return model.apply(params, x)
# Run on TPU
x = jnp.ones((32, 128))
y = forward(params, x)36.6.3 TPU Optimization Tips
# 1. Batch sizes must be multiples of 8 (for 128x128 MXUs)
batch_size = 128 # Good
batch_size = 100 # Pad to 104 or 128
# 2. Use bfloat16 (TPU native format)
x = x.astype(jnp.bfloat16)
# 3. Shard across TPU chips
from jax.sharding import PartitionSpec as P
# Model parallel across 4 chips
sharding = jax.sharding.NamedSharding(
mesh=jax.sharding.Mesh(jax.devices(), ('x',)),
spec=P('x', None)
)
# 4. Use pjit for multi-chip parallelism
from jax.experimental import pjit
@pjit.pjit
def train_step(params, batch):
...36.7 Cross-Platform Strategies
36.7.1 Abstraction Layers
# Strategy 1: Framework abstraction (PyTorch)
device = torch.device('cuda' if torch.cuda.is_available()
else 'mps' if torch.backends.mps.is_available()
else 'cpu')
# Strategy 2: Hardware detection
def get_accelerator():
if torch.cuda.is_available():
props = torch.cuda.get_device_properties(0)
if 'H100' in props.name:
return 'h100'
elif 'A100' in props.name:
return 'a100'
elif 'MI' in props.name:
return 'amd'
elif torch.backends.mps.is_available():
return 'apple'
return 'cpu'
# Strategy 3: Configuration-based optimization
HARDWARE_CONFIGS = {
'h100': {
'precision': 'fp8',
'batch_size': 64,
'use_flash_attention': 3,
},
'a100': {
'precision': 'bf16',
'batch_size': 32,
'use_flash_attention': 2,
},
'amd': {
'precision': 'bf16',
'batch_size': 48,
'use_flash_attention': 2,
},
'apple': {
'precision': 'fp16',
'batch_size': 16,
'use_flash_attention': False,
},
}
config = HARDWARE_CONFIGS[get_accelerator()]36.7.2 Portable Performance Code
class PortableModel(nn.Module):
"""Model that adapts to hardware."""
def __init__(self):
super().__init__()
self.accelerator = get_accelerator()
# Choose implementation based on hardware
if self.accelerator in ['h100', 'a100']:
from flash_attn import flash_attn_func
self.attention = flash_attn_func
else:
self.attention = self.naive_attention
def naive_attention(self, q, k, v):
"""Fallback for non-NVIDIA hardware."""
scores = torch.matmul(q, k.transpose(-2, -1))
scores = scores / math.sqrt(k.shape[-1])
attn = F.softmax(scores, dim=-1)
return torch.matmul(attn, v)
def forward(self, x):
q, k, v = self.qkv_proj(x).chunk(3, dim=-1)
return self.attention(q, k, v)36.8 Benchmarking Across Hardware
36.8.1 Fair Comparisons
def benchmark_hardware(model, input_shape, num_iterations=100):
"""Benchmark with proper methodology."""
x = torch.randn(input_shape, device='cuda')
# Warmup (critical for accurate timing)
for _ in range(10):
_ = model(x)
torch.cuda.synchronize()
# Timed runs
times = []
for _ in range(num_iterations):
torch.cuda.synchronize()
start = time.perf_counter()
_ = model(x)
torch.cuda.synchronize()
times.append(time.perf_counter() - start)
return {
'median_ms': np.median(times) * 1000,
'std_ms': np.std(times) * 1000,
'throughput': input_shape[0] / np.median(times),
}36.8.2 Hardware-Specific Metrics
def get_hardware_metrics():
"""Collect hardware-specific utilization."""
if torch.cuda.is_available():
return {
'gpu_util': get_gpu_utilization(),
'memory_used': torch.cuda.memory_allocated(),
'memory_bandwidth_util': measure_bandwidth_util(),
'tensor_core_util': get_tensor_core_util(), # NVIDIA only
}
elif torch.backends.mps.is_available():
return {
'memory_used': torch.mps.current_allocated_memory(),
# MPS has fewer metrics exposed
}36.9 Key Takeaways
Blackwell > H100 > A100: Each generation adds features (FP4, enhanced TMA, NVLink 5) for significant speedups.
H100 > A100: TMA, FP8, clusters provide 2x+ speedup on optimized code.
AMD is competitive: MI300X/MI325X have more memory and bandwidth; software is catching up.
Apple Silicon is different: Unified memory changes optimization strategies.
TPU needs XLA: Can’t just port CUDA code; requires framework changes.
Abstract hardware differences: Use config-based optimization, not hardcoded paths.
Benchmark on target: Performance doesn’t transfer across hardware.
Watch the ecosystem: Hardware-specific libraries (FA3, Transformer Engine) matter.