6 When Parallelism Pays
Understanding the True Cost of Coordination
Parallelism promises: “More cores, more speed.”
Parallelism delivers: “More cores, more coordination overhead, sometimes more speed.”
Understanding when parallelism helps—and when it hurts—is essential for making good performance decisions.
6.1 The Third Lie
In the first two chapters, we dismantled two comforting illusions:
- Memory is uniform (it’s not—there’s a 200× latency hierarchy)
- FLOPS measure performance (they don’t—bandwidth often dominates)
Now we confront a third: parallel speedup is free.
The naive model: if you have 8 cores, your code runs 8× faster. If you have 64 cores, 64× faster. GPUs have thousands of cores, so they’re thousands of times faster than CPUs.
This model is wrong. Sometimes catastrophically wrong.
6.2 Amdahl’s Law: The Famous Limit
Gene Amdahl’s 1967 observation is the most-cited result in parallel computing:
\[\text{Speedup} = \frac{1}{(1-p) + \frac{p}{n}}\]
Where:
- \(p\) is the fraction of execution time that can be parallelized
- \(n\) is the number of processors
- \((1-p)\) is the serial fraction
The insight: if 10% of your code is serial, you can never get more than 10× speedup, no matter how many cores you add.
| Serial Fraction | Max Speedup (∞ cores) |
|---|---|
| 1% | 100× |
| 5% | 20× |
| 10% | 10× |
| 25% | 4× |
| 50% | 2× |
This is taught in every parallel computing course. But it’s the optimistic bound.
6.4 Investigation: Where Does Parallelism Actually Help?
Let’s measure. We’ll compare two workloads:
Compute-bound: Square root computation (lots of math per byte)
Memory-bound: Array summation (minimal math per byte)
import numpy as np
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import time
def benchmark_scaling(func, data, max_workers=16):
"""Measure speedup as we add workers."""
times = {}
# Baseline: single worker
start = time.perf_counter()
func(data, num_workers=1)
times[1] = time.perf_counter() - start
for n in [2, 4, 8, 12, 16]:
start = time.perf_counter()
func(data, num_workers=n)
times[n] = time.perf_counter() - start
return {n: times[1] / t for n, t in times.items()}
# Compute-bound: many operations per element
# Uses ProcessPoolExecutor to bypass Python's GIL (see note below)
def compute_bound_work(data, num_workers):
chunk_size = len(data) // num_workers
chunks = [data[i*chunk_size:(i+1)*chunk_size] for i in range(num_workers)]
with ProcessPoolExecutor(max_workers=num_workers) as ex:
# Each process does heavy vectorized math independently
results = list(ex.map(
lambda chunk: np.sum(np.sqrt(np.abs(chunk) + 1)),
chunks
))
return sum(results)
# Memory-bound: minimal operations per element
# Uses ThreadPoolExecutor — NumPy releases the GIL during .sum()
def memory_bound_work(data, num_workers):
def worker(chunk):
return chunk.sum() # Just read and add
chunk_size = len(data) // num_workers
chunks = [data[i*chunk_size:(i+1)*chunk_size] for i in range(num_workers)]
with ThreadPoolExecutor(max_workers=num_workers) as ex:
results = list(ex.map(worker, chunks))
return sum(results)
# Run the investigation
data = np.random.randn(10_000_000)
print("Compute-bound scaling:")
compute_scaling = benchmark_scaling(compute_bound_work, data)
for n, speedup in compute_scaling.items():
print(f" {n:2d} workers: {speedup:.2f}x")
print("\nMemory-bound scaling:")
memory_scaling = benchmark_scaling(memory_bound_work, data)
for n, speedup in memory_scaling.items():
print(f" {n:2d} workers: {speedup:.2f}x")CPython’s GIL prevents multiple threads from executing Python bytecode simultaneously. This means threading.Thread and ThreadPoolExecutor cannot achieve true CPU-parallel speedup for pure Python computation.
Two workarounds:
- Use
ProcessPoolExecutor(separate processes, each with its own GIL) for compute-bound work — as shown above - Use
ThreadPoolExecutorwith libraries that release the GIL — NumPy, SciPy, and PyTorch release the GIL during C-level operations like.sum(),np.sqrt(), andtorch.matmul()
The memory-bound example above uses threading with NumPy (which releases the GIL), while the compute-bound example uses multiprocessing. This distinction matters: if you use ThreadPoolExecutor for CPU-bound Python loops, you’ll see no speedup at all.
Typical results on a 16-core machine:
Compute-bound scaling (processes):
1 workers: 1.00x
2 workers: 1.85x
4 workers: 3.52x
8 workers: 6.41x
12 workers: 8.8x
16 workers: 10.2x
Memory-bound scaling (threads + NumPy):
1 workers: 1.00x
2 workers: 1.89x
4 workers: 3.21x
8 workers: 3.45x <-- Saturated!
12 workers: 3.52x
16 workers: 3.48x
Note: The compute-bound scaling is less than linear because ProcessPoolExecutor incurs inter-process communication overhead (serializing data, IPC). A C++ implementation with threads would achieve near-linear scaling.
The compute-bound workload scales nearly linearly. The memory-bound workload hits a wall at ~4 threads.
6.5 The GPU Paradox
GPUs have thousands of cores. They should be thousands of times faster than CPUs, right?
Sometimes yes. Often no.
GPUs achieve massive parallelism through a specific model:
- SIMT (Single Instruction, Multiple Thread): Many threads execute the same instruction on different data
- Memory coalescing: Threads must access adjacent memory addresses
- Occupancy: Many threads must be active to hide memory latency
When code fits this model, GPUs dominate. When it doesn’t, they struggle.
6.5.1 When GPUs Win
Operation: Matrix Multiply (4096×4096)
CPU (32 cores): ~2000 ms
GPU (A100): ~5 ms
Speedup: 400×
Matrix multiply is perfect for GPUs:
- Massively parallel (each output element independent)
- Regular access patterns (coalesced memory access)
- High arithmetic intensity (lots of work per byte)
6.5.2 When GPUs Struggle
Operation: Linked List Traversal
CPU (1 core): ~50 ms
GPU (A100): ~500 ms
"Speedup": 0.1× (10× slower!)
Linked list traversal is terrible for GPUs:
- Sequential dependencies (can’t parallelize)
- Random memory access (can’t coalesce)
- Low arithmetic intensity (one read per step)
6.5.3 The Warp Divergence Problem
GPUs execute threads in groups called “warps” (32 threads on NVIDIA). All threads in a warp must execute the same instruction.
What happens with conditionals?
# Pseudocode for GPU kernel
def kernel(data, output):
idx = get_thread_id()
if data[idx] > 0:
output[idx] = expensive_computation_A(data[idx])
else:
output[idx] = expensive_computation_B(data[idx])If half the threads take the if branch and half take else, the warp must execute both paths. Threads not on the current branch are masked (they wait, doing nothing).
Worst case: 32× slowdown from divergence.
Warp Execution with Divergence:
Time →
Thread 0: [████████████████]................ (compute A)
Thread 1: ................[████████████████] (compute B)
Thread 2: [████████████████]................ (compute A)
Thread 3: ................[████████████████] (compute B)
...
Thread 31: [████████████████]................ (compute A)
Both branches take full time. No speedup from parallelism.
6.6 The Real Limit: Coordination Grows
Here’s what Amdahl doesn’t capture: coordination overhead often grows with parallelism.
6.6.1 Tree Reduction
Summing N numbers with P processors:
- Each processor sums N/P elements locally: O(N/P)
- Combine partial sums in a tree: O(log P) steps
The coordination (tree reduction) grows logarithmically. For large P, this dominates.
Tree Reduction with 8 processors:
Step 1: [P0]──┬──[P1] [P2]──┬──[P3] [P4]──┬──[P5] [P6]──┬──[P7]
│ │ │ │
Step 2: [S01]────────[S23] [S45]────────[S67]
│ │
Step 3: [S0123]──────────────────[S4567]
│
Step 4: [FINAL]
4 steps of coordination for 8 processors.
With 1024 processors: 10 steps.
With 1M processors: 20 steps.
6.6.2 All-to-All Communication
Some algorithms require every processor to communicate with every other:
- Gradient synchronization in distributed training
- Global transpose in FFT
- Certain graph algorithms
Cost: O(P²) messages or O(P) with optimized algorithms. Either way, it grows.
6.7 When Parallelism Helps: A Checklist
Before parallelizing, ask:
- Is the workload compute-bound?
- If memory-bound, you’ll saturate bandwidth quickly
- Measure arithmetic intensity first
- Is the work divisible?
- Can you split into independent chunks?
- Are there dependencies between iterations?
- Is the granularity coarse enough?
- Fine-grained parallelism → coordination overhead dominates
- Rule of thumb: each parallel unit should do 10,000+ operations
- Will threads fight over resources?
- Shared data structures with locks
- False sharing (threads accessing nearby cache lines)
- Memory bandwidth
- Does the algorithm fit the hardware?
- GPUs: regular, data-parallel, high arithmetic intensity
- CPUs: irregular, task-parallel, lower latency tolerance
6.8 The Strategic Insight
Parallelism isn’t free performance. It’s a trade:
- You trade simplicity for potential speedup
- You trade predictability for potential throughput
- You trade debugging ease for potential scalability
The question isn’t “how do I parallelize this?” but “should I parallelize this, and where?”
The answer often comes from the previous chapters:
- Memory hierarchy (from Memory Hierarchy): If your algorithm is memory-bound, more parallelism won’t help after you saturate bandwidth
- Arithmetic intensity (from Bandwidth): High intensity → parallelism helps. Low intensity → fix the algorithm first
The Decision Tree:
Is it slow?
│
├─ No → Don't parallelize
│
└─ Yes → Is it compute-bound?
│
├─ No → Fix memory access patterns first (Ch 1-2)
│
└─ Yes → Is the work divisible?
│
├─ No → Different algorithm needed
│
└─ Yes → Can you avoid coordination?
│
├─ No → Minimize coordination
│
└─ Yes → Parallelize!
6.9 Key Takeaways
Amdahl is optimistic: Real parallel overhead often exceeds the serial fraction
Memory saturates fast: Memory-bound workloads hit bandwidth limits at low thread counts
Coordination grows: Synchronization, communication, and contention scale with parallelism
GPUs are specialized: They excel at regular, data-parallel, high-intensity work—and struggle with everything else
Measure first: Profile to find the bottleneck. If it’s not compute-bound, parallelism isn’t the answer
Hidden variables abound: Parallelism introduces hidden variables—NUMA effects, cache contention, scheduler interference—that your mental model may not include. When scaling disappoints, use differential diagnosis (from Hypothesis) to isolate which hidden variable is limiting you
6.10 Connection to Mathematical Properties
In the coming chapters, we’ll see how mathematical structure determines parallelizability:
- Associativity (from Associativity): Operations that associate can be reduced in any order—enabling parallel reduction
- Commutativity: Order-invariant operations can be reordered for better locality
- Independence: Separable computations can run in parallel without coordination
The math isn’t abstract—it’s a guide to what can be parallelized without changing the answer.
6.11 Further Reading
- Amdahl, G. (1967). “Validity of the single processor approach to achieving large scale computing capabilities”
- Gustafson, J. (1988). “Reevaluating Amdahl’s Law” (the optimistic counterpoint)
- NVIDIA CUDA Programming Guide - Warp divergence and occupancy