40 Production Case Studies

War Stories from Real Systems

Theory teaches you what’s possible. Production teaches you what actually goes wrong. This chapter presents real performance investigations—the kind that start with “the system is slow” and end with root cause and fix.

40.1 Why Case Studies Matter

Every previous chapter presented clean examples. Production is messier:

Symptoms don’t match causes
Multiple issues interact
Fixes have trade-offs
Time pressure changes decisions

These case studies follow real investigation patterns, showing not just the fix but the process of finding it.

40.2 Case Study 1: The 10× Training Slowdown

40.2.1 The Symptom

A research team reports their training run is 10× slower than last week. Same model, same data, same code (they claim).

40.2.2 The Investigation

Step 1: Verify the claim

# Check training throughput
grep "samples/sec" training.log | tail -20

# Last week: 1200 samples/sec
# This week: 120 samples/sec

The 10× slowdown is real.

Step 2: Rule out the obvious

# Check GPU utilization
nvidia-smi --query-gpu=utilization.gpu --format=csv -l 1

# Result: 15% (should be 90%+)

GPU is mostly idle. The problem isn’t compute—it’s something feeding the GPU.

Step 3: Profile the training loop

import torch.profiler

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    with_stack=True,
) as prof:
    for i, batch in enumerate(train_loader):
        if i >= 10:
            break
        # ... training step ...

prof.export_chrome_trace("trace.json")

The trace reveals large gaps between GPU kernels:

Timeline:
CPU: [DataLoader]................[DataLoader]................[DataLoader]
GPU:             [Forward][Backward]         [Forward][Backward]
                 ^--- only 15% of time       ^--- GPU starved

Step 4: Isolate data loading

# Measure data loading time
import time

times = []
for i, batch in enumerate(train_loader):
    start = time.time()
    _ = batch  # Just iterate
    times.append(time.time() - start)
    if i >= 100:
        break

print(f"Mean batch time: {sum(times)/len(times)*1000:.1f}ms")
print(f"Max batch time: {max(times)*1000:.1f}ms")

# Result: Mean: 450ms, Max: 2100ms
# Expected: ~50ms

Data loading is 9× slower than expected.

Step 5: Find the data loading bottleneck

# Profile the data loading code
import cProfile
import pstats

with cProfile.Profile() as pr:
    for i, batch in enumerate(train_loader):
        if i >= 10:
            break

stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(20)

40.2.3 Lessons Learned

GPU utilization is the first check: Low GPU util means you’re feeding it wrong
Profile before debugging: The trace pointed directly at data loading
Code changes are the usual suspect: “Same code” is often not true
Conversions are expensive: Format changes (PIL↔︎numpy↔︎tensor) add up

40.3 Case Study 2: Inference Cost Reduction

40.3.1 The Symptom

An inference service costs $50K/month in GPU compute. Target: reduce to $25K/month without significant latency increase.

40.3.2 The Baseline

# Current setup
Model: LLaMA-7B
Hardware: 8x A100 40GB
Serving: vLLM
Throughput: 50 requests/sec
P99 latency: 800ms
Cost: ~$20K/month GPU-only (≈$50K/month all-in)

40.3.3 Investigation Path 1: Quantization

Hypothesis: INT8 quantization could double throughput.

from transformers import AutoModelForCausalLM
import torch

# Load with INT8 quantization
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b",
    load_in_8bit=True,
    device_map="auto"
)

Results:

INT8 quantization:
- Throughput: 50 → 85 requests/sec (+70%)
- P99 latency: 800ms → 750ms (improved!)
- Quality: 0.1% perplexity increase (acceptable)

Promising, but not 2× yet.

40.3.4 Investigation Path 2: Batching Strategy

Hypothesis: Larger batch sizes could improve throughput.

# Profile batch size vs latency/throughput
for batch_size in [1, 2, 4, 8, 16, 32]:
    latency, throughput = benchmark(model, batch_size)
    print(f"Batch {batch_size}: {throughput:.1f} req/s, {latency:.0f}ms P99")

Results:

Batch 1:  85 req/s, 750ms P99
Batch 2:  140 req/s, 850ms P99
Batch 4:  210 req/s, 950ms P99
Batch 8:  280 req/s, 1100ms P99  ← exceeds latency budget
Batch 16: 320 req/s, 1400ms P99

At batch size 4: 4.2× throughput improvement over baseline, within latency budget.

40.3.5 Investigation Path 3: Continuous Batching

Hypothesis: Dynamic batching wastes less compute than static batching.

# vLLM already uses continuous batching, but we can tune it
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-2-7b",
    quantization="awq",  # Even better quantization
    max_num_batched_tokens=8192,  # Tune for our latency budget
    max_num_seqs=256,  # Max concurrent sequences
)

Results with AWQ + tuned continuous batching:

Throughput: 50 → 350 requests/sec (7×!)
P99 latency: 800ms → 920ms (within 1s budget)

40.3.6 Investigation Path 4: Right-sizing Hardware

Hypothesis: We might be over-provisioned.

Current (GPU-only): 8× A100 40GB at $3.50/GPU/hour = $20,160/month

With 7× throughput, we need:

Old: 50 req/s ÷ 50 req/s/8GPUs = 8 GPUs
New: 50 req/s ÷ 350 req/s = 0.14 of our capacity

# But keep some headroom for spikes
Required: 2 GPUs (with 75% average utilization)

Final Configuration:

Hardware: 2× A100 40GB (down from 8)
Model: LLaMA-7B with AWQ quantization
Serving: vLLM with tuned batching
Throughput: 175 req/s per GPU
Cost: $5,040/month (90% reduction!)

40.3.7 Summary

Change	Throughput Gain	Cumulative
Baseline	1×	50 req/s
INT8 quantization	1.7×	85 req/s
Batch size 4	2.5×	210 req/s
AWQ + continuous batching	1.7×	350 req/s
Total improvement	7×

Cost reduction: $50K → $5K/month (90% savings).

40.3.8 Lessons Learned

Quantization is free throughput: INT8/AWQ often improves speed with minimal quality loss
Batching transforms economics: Amortizing fixed costs over more requests is powerful
Right-size after optimizing: Optimize first, then reduce hardware
Measure P99, not average: Average latency hides user-facing problems

40.4 Case Study 3: The Memory Leak

40.4.1 The Symptom

Training crashes with OOM after 12 hours. Worked fine for months.

40.4.2 The Investigation

Step 1: Monitor memory over time

import torch
import gc

def log_memory():
    allocated = torch.cuda.memory_allocated() / 1e9
    reserved = torch.cuda.memory_reserved() / 1e9
    print(f"Allocated: {allocated:.2f}GB, Reserved: {reserved:.2f}GB")

# Log every 100 steps
for step in range(num_steps):
    train_step()
    if step % 100 == 0:
        log_memory()

Results:

Step 0:     Allocated: 8.50GB, Reserved: 10.00GB
Step 100:   Allocated: 8.52GB, Reserved: 10.00GB
Step 500:   Allocated: 8.80GB, Reserved: 11.00GB  ← growing
Step 1000:  Allocated: 9.20GB, Reserved: 12.00GB
Step 5000:  Allocated: 12.50GB, Reserved: 14.00GB
Step 10000: OOM

Memory grows 4MB per step. Over 50K steps: 200GB leak.

Step 2: Identify what’s growing

# Use torch's memory snapshot
torch.cuda.memory._record_memory_history()

# After some steps
snapshot = torch.cuda.memory._snapshot()

# Export for visualization
with open("memory_snapshot.pickle", "wb") as f:
    pickle.dump(snapshot, f)

# View with: python -m torch.cuda.memory_viz snapshot

The snapshot shows growing number of gradient tensors.

Step 3: Find the gradient accumulation

# Check if tensors are being retained
def check_grad_graph():
    for name, param in model.named_parameters():
        if param.grad is not None:
            if param.grad.grad_fn is not None:
                print(f"LEAK: {name} has grad with history")

# Run after backward
loss.backward()
check_grad_graph()

Result:

LEAK: transformer.layer.0.attention.query.weight has grad with history
LEAK: transformer.layer.0.attention.key.weight has grad with history
...

Gradients are retaining computation history.

Step 4: Find the culprit

Recent code change search:

# Someone added gradient clipping like this:
for param in model.parameters():
    if param.grad is not None:
        param.grad = torch.clamp(param.grad, -1, 1)  # BUG!

The Problem

torch.clamp is a differentiable operation. When you assign its output to param.grad, you create a new tensor with computation history. That history keeps old gradients alive.

The Fix

# Use in-place clipping
for param in model.parameters():
    if param.grad is not None:
        param.grad.clamp_(-1, 1)  # In-place: no new tensor

# Or use the built-in function
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=1.0)

Step 5: Verify the fix

# After fix:
Step 0:     Allocated: 8.50GB
Step 1000:  Allocated: 8.50GB
Step 10000: Allocated: 8.50GB  ✓ Stable

40.4.3 Lessons Learned

Memory should be constant: Training memory shouldn’t grow over time
In-place operations don’t create history: Use _ suffix methods for gradients
Record memory history for debugging: PyTorch’s memory tools are powerful
Review gradient-touching code carefully: It’s easy to accidentally retain graphs

40.5 Case Study 4: Distributed Training Scaling

40.5.1 The Symptom

Training on 8 GPUs is only 4× faster than 1 GPU, not 8×.

40.5.2 The Investigation

Step 1: Measure scaling

# Benchmark throughput at different scales
for num_gpus in [1, 2, 4, 8]:
    throughput = run_benchmark(num_gpus)
    efficiency = throughput / (num_gpus * single_gpu_throughput)
    print(f"{num_gpus} GPUs: {throughput:.0f} samples/s ({efficiency:.1%} efficiency)")

Results:

1 GPU:  1000 samples/s (100% efficiency)
2 GPUs: 1800 samples/s (90% efficiency)
4 GPUs: 3200 samples/s (80% efficiency)
8 GPUs: 4000 samples/s (50% efficiency) ← Problem here

Significant efficiency drop at 8 GPUs.

Step 2: Profile communication

# Use PyTorch profiler with NCCL tracing
with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ],
    record_shapes=True,
) as prof:
    train_step()

# Look for collective operations
for event in prof.key_averages():
    if "nccl" in event.key.lower():
        print(f"{event.key}: {event.cuda_time_total/1000:.1f}ms")

Results:

ncclAllReduce: 450ms per step  ← 45% of step time!
Compute: 550ms per step

Communication is the bottleneck.

Step 3: Analyze communication pattern

# Check gradient sizes
total_params = sum(p.numel() for p in model.parameters())
print(f"Total parameters: {total_params:,}")
print(f"Gradient size: {total_params * 4 / 1e9:.2f} GB")

# Per-step communication (all-reduce sends 2× gradient size)
print(f"Data per step: {total_params * 4 * 2 / 1e9:.2f} GB")

Results:

Total parameters: 1,000,000,000
Gradient size: 4.0 GB
Data per step: 8.0 GB

Sending 8GB per step across 8 GPUs with limited interconnect bandwidth.

Step 4: Check interconnect

nvidia-smi topo -m

Result:

        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    GPU6    GPU7
GPU0     X      NV4     NV4     NV4     SYS     SYS     SYS     SYS
GPU1    NV4      X      NV4     NV4     SYS     SYS     SYS     SYS
...

GPUs 0-3 have NVLink (fast), but GPUs 4-7 communicate via PCIe (slow).

Step 5: Apply fixes

Fix 1: Gradient compression

# Use PowerSGD for gradient compression
from torch.distributed.algorithms.ddp_comm_hooks import powerSGD_hook

model = DDP(model)
model.register_comm_hook(
    state=powerSGD_hook.PowerSGDState(
        process_group=dist.group.WORLD,
        matrix_approximation_rank=32,  # Compress to rank 32
    ),
    hook=powerSGD_hook.powerSGD_hook,
)

Fix 2: Overlap computation and communication

# Already default in DDP, but verify it's enabled
model = DDP(
    model,
    gradient_as_bucket_view=True,
    static_graph=True,  # Enable additional optimizations
)

Fix 3: Use better topology-aware placement

# Group processes by NVLink connectivity
# Train two data-parallel groups of 4 GPUs each
# Each group has full NVLink connectivity

Results after fixes:

Before fixes:
8 GPUs: 4000 samples/s (50% efficiency)
AllReduce: 450ms/step

After fixes:
8 GPUs: 6800 samples/s (85% efficiency)
AllReduce: 180ms/step (60% reduction)

40.5.3 Lessons Learned

Communication often limits scaling: Profile communication separately from compute
Topology matters: NVLink vs PCIe is a 10× bandwidth difference
Compression helps: PowerSGD and similar techniques reduce communication
Overlap is essential: Never let GPUs wait for communication if avoidable

40.6 Setting Up Continuous Performance Monitoring

40.6.1 The Problem with Ad-Hoc Profiling

Profiling when things are slow misses the regression point. You need continuous monitoring.

40.6.2 Basic Throughput Tracking

import time
import wandb  # or your logging system

class PerformanceTracker:
    def __init__(self, log_every_n_steps=100):
        self.log_every = log_every_n_steps
        self.step_times = []
        self.step_count = 0

    def step_start(self):
        self.start_time = time.perf_counter()

    def step_end(self, batch_size):
        elapsed = time.perf_counter() - self.start_time
        self.step_times.append(elapsed)
        self.step_count += 1

        if self.step_count % self.log_every == 0:
            recent_times = self.step_times[-self.log_every:]
            avg_time = sum(recent_times) / len(recent_times)
            throughput = batch_size / avg_time

            wandb.log({
                "perf/step_time_ms": avg_time * 1000,
                "perf/throughput_samples_sec": throughput,
                "perf/gpu_memory_gb": torch.cuda.max_memory_allocated() / 1e9,
            })

40.6.3 Detecting Regressions

class RegressionDetector:
    def __init__(self, baseline_throughput, threshold=0.2):
        self.baseline = baseline_throughput
        self.threshold = threshold  # 20% regression triggers alert

    def check(self, current_throughput):
        regression = (self.baseline - current_throughput) / self.baseline

        if regression > self.threshold:
            self.alert(f"Performance regression: {regression:.1%} slowdown")

    def alert(self, message):
        # Send to Slack, PagerDuty, etc.
        print(f"ALERT: {message}")

40.6.4 Integration with CI/CD

# .github/workflows/performance.yml
name: Performance Tests

on: [push]

jobs:
  benchmark:
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v3

      - name: Run benchmarks
        run: python benchmark.py --output results.json

      - name: Check for regressions
        run: |
          python check_regression.py \
            --current results.json \
            --baseline baseline.json \
            --threshold 0.1

40.7 Key Takeaways

Start with utilization metrics: Low GPU util points to feeding problems
Profile before optimizing: Find the bottleneck, don’t guess
Check recent changes: Most regressions come from recent code
Memory should be constant: Growing memory means a leak
Communication scales poorly: Design for minimal cross-device data movement
Monitor continuously: Catch regressions early with automated tracking

40.8 Connections

Profiling Tools: The specific tools used in these investigations
Distributed Training: Theoretical background for scaling issues
Measurement: The scientific method applied to debugging

Try It Yourself

The accompanying notebook provides:

Synthetic reproduction of each case study
Templates for continuous performance monitoring
Regression detection framework

Notebook support for this chapter is in progress. For now, reproduce these scenarios locally and use your own monitoring traces.

40.9 Further Reading

PyTorch Performance Tuning Guide
NVIDIA Deep Learning Performance Guide
vLLM: High-throughput LLM Serving - Production inference optimization