6 When Parallelism Pays
Understanding the True Cost of Coordination
Parallelism promises: “More cores, more speed.”
Parallelism delivers: “More cores, more coordination overhead, sometimes more speed.”
Understanding when parallelism helps—and when it hurts—is essential for making good performance decisions.
6.1 The Third Lie
In the first two chapters, we dismantled two comforting illusions:
- Memory is uniform (it’s not—there’s a 200× latency hierarchy)
- FLOPS measure performance (they don’t—bandwidth often dominates)
Now we confront a third: parallel speedup is free.
The naive model: if you have 8 cores, your code runs 8× faster. If you have 64 cores, 64× faster. GPUs have thousands of cores, so they’re thousands of times faster than CPUs.
This model is wrong. Sometimes catastrophically wrong.
6.2 Amdahl’s Law: The Famous Limit
Gene Amdahl’s 1967 observation is the most-cited result in parallel computing:
\[\text{Speedup} = \frac{1}{(1-p) + \frac{p}{n}}\]
Where:
- \(p\) is the fraction of code that can be parallelized
- \(n\) is the number of processors
- \((1-p)\) is the serial fraction
The insight: if 10% of your code is serial, you can never get more than 10× speedup, no matter how many cores you add.
| Serial Fraction | Max Speedup (∞ cores) |
|---|---|
| 1% | 100× |
| 5% | 20× |
| 10% | 10× |
| 25% | 4× |
| 50% | 2× |
This is taught in every parallel computing course. But it’s the optimistic bound.
6.4 Investigation: Where Does Parallelism Actually Help?
Let’s measure. We’ll compare two workloads:
Compute-bound: Square root computation (lots of math per byte)
Memory-bound: Array summation (minimal math per byte)
import numpy as np
from concurrent.futures import ThreadPoolExecutor
import time
def benchmark_scaling(func, data, max_threads=16):
"""Measure speedup as we add threads."""
times = {}
# Baseline: single thread
start = time.perf_counter()
func(data, num_threads=1)
times[1] = time.perf_counter() - start
for n in [2, 4, 8, 12, 16]:
start = time.perf_counter()
func(data, num_threads=n)
times[n] = time.perf_counter() - start
return {n: times[1] / t for n, t in times.items()}
# Compute-bound: many operations per element
def compute_bound_work(data, num_threads):
def worker(chunk):
result = 0
for x in chunk:
# 100 iterations of expensive math
for _ in range(100):
result += np.sqrt(abs(x) + 1)
return result
chunk_size = len(data) // num_threads
chunks = [data[i*chunk_size:(i+1)*chunk_size] for i in range(num_threads)]
with ThreadPoolExecutor(max_workers=num_threads) as ex:
results = list(ex.map(worker, chunks))
return sum(results)
# Memory-bound: minimal operations per element
def memory_bound_work(data, num_threads):
def worker(chunk):
return chunk.sum() # Just read and add
chunk_size = len(data) // num_threads
chunks = [data[i*chunk_size:(i+1)*chunk_size] for i in range(num_threads)]
with ThreadPoolExecutor(max_workers=num_threads) as ex:
results = list(ex.map(worker, chunks))
return sum(results)
# Run the investigation
data = np.random.randn(10_000_000)
print("Compute-bound scaling:")
compute_scaling = benchmark_scaling(compute_bound_work, data)
for n, speedup in compute_scaling.items():
print(f" {n:2d} threads: {speedup:.2f}x")
print("\nMemory-bound scaling:")
memory_scaling = benchmark_scaling(memory_bound_work, data)
for n, speedup in memory_scaling.items():
print(f" {n:2d} threads: {speedup:.2f}x")Typical results on a 16-core machine:
Compute-bound scaling:
1 threads: 1.00x
2 threads: 1.95x
4 threads: 3.82x
8 threads: 7.41x
12 threads: 10.8x
16 threads: 13.2x
Memory-bound scaling:
1 threads: 1.00x
2 threads: 1.89x
4 threads: 3.21x
8 threads: 3.45x <-- Saturated!
12 threads: 3.52x
16 threads: 3.48x
The compute-bound workload scales nearly linearly. The memory-bound workload hits a wall at ~4 threads.
6.5 The GPU Paradox
GPUs have thousands of cores. They should be thousands of times faster than CPUs, right?
Sometimes yes. Often no.
GPUs achieve massive parallelism through a specific model:
- SIMT (Single Instruction, Multiple Thread): Many threads execute the same instruction on different data
- Memory coalescing: Threads must access adjacent memory addresses
- Occupancy: Many threads must be active to hide memory latency
When code fits this model, GPUs dominate. When it doesn’t, they struggle.
6.5.1 When GPUs Win
Operation: Matrix Multiply (4096×4096)
CPU (32 cores): ~2000 ms
GPU (A100): ~5 ms
Speedup: 400×
Matrix multiply is perfect for GPUs:
- Massively parallel (each output element independent)
- Regular access patterns (coalesced memory access)
- High arithmetic intensity (lots of work per byte)
6.5.2 When GPUs Struggle
Operation: Linked List Traversal
CPU (1 core): ~50 ms
GPU (A100): ~500 ms
"Speedup": 0.1× (10× slower!)
Linked list traversal is terrible for GPUs:
- Sequential dependencies (can’t parallelize)
- Random memory access (can’t coalesce)
- Low arithmetic intensity (one read per step)
6.5.3 The Warp Divergence Problem
GPUs execute threads in groups called “warps” (32 threads on NVIDIA). All threads in a warp must execute the same instruction.
What happens with conditionals?
# Pseudocode for GPU kernel
def kernel(data, output):
idx = get_thread_id()
if data[idx] > 0:
output[idx] = expensive_computation_A(data[idx])
else:
output[idx] = expensive_computation_B(data[idx])If half the threads take the if branch and half take else, the warp must execute both paths. Threads not on the current branch are masked (they wait, doing nothing).
Worst case: 32× slowdown from divergence.
Warp Execution with Divergence:
Time →
Thread 0: [████████████████]................ (compute A)
Thread 1: ................[████████████████] (compute B)
Thread 2: [████████████████]................ (compute A)
Thread 3: ................[████████████████] (compute B)
...
Thread 31: [████████████████]................ (compute A)
Both branches take full time. No speedup from parallelism.
6.6 The Real Limit: Coordination Grows
Here’s what Amdahl doesn’t capture: coordination overhead often grows with parallelism.
6.6.1 Tree Reduction
Summing N numbers with P processors:
- Each processor sums N/P elements locally: O(N/P)
- Combine partial sums in a tree: O(log P) steps
The coordination (tree reduction) grows logarithmically. For large P, this dominates.
Tree Reduction with 8 processors:
Step 1: [P0]──┬──[P1] [P2]──┬──[P3] [P4]──┬──[P5] [P6]──┬──[P7]
│ │ │ │
Step 2: [S01]────────[S23] [S45]────────[S67]
│ │
Step 3: [S0123]──────────────────[S4567]
│
Step 4: [FINAL]
4 steps of coordination for 8 processors.
With 1024 processors: 10 steps.
With 1M processors: 20 steps.
6.6.2 All-to-All Communication
Some algorithms require every processor to communicate with every other:
- Gradient synchronization in distributed training
- Global transpose in FFT
- Certain graph algorithms
Cost: O(P²) messages or O(P) with optimized algorithms. Either way, it grows.
6.7 When Parallelism Helps: A Checklist
Before parallelizing, ask:
- Is the workload compute-bound?
- If memory-bound, you’ll saturate bandwidth quickly
- Measure arithmetic intensity first
- Is the work divisible?
- Can you split into independent chunks?
- Are there dependencies between iterations?
- Is the granularity coarse enough?
- Fine-grained parallelism → coordination overhead dominates
- Rule of thumb: each parallel unit should do 10,000+ operations
- Will threads fight over resources?
- Shared data structures with locks
- False sharing (threads accessing nearby cache lines)
- Memory bandwidth
- Does the algorithm fit the hardware?
- GPUs: regular, data-parallel, high arithmetic intensity
- CPUs: irregular, task-parallel, lower latency tolerance
6.8 The Strategic Insight
Parallelism isn’t free performance. It’s a trade:
- You trade simplicity for potential speedup
- You trade predictability for potential throughput
- You trade debugging ease for potential scalability
The question isn’t “how do I parallelize this?” but “should I parallelize this, and where?”
The answer often comes from the previous chapters:
- Memory hierarchy (Chapter 1): If your algorithm is memory-bound, more parallelism won’t help after you saturate bandwidth
- Arithmetic intensity (Chapter 2): High intensity → parallelism helps. Low intensity → fix the algorithm first
The Decision Tree:
Is it slow?
│
├─ No → Don't parallelize
│
└─ Yes → Is it compute-bound?
│
├─ No → Fix memory access patterns first (Ch 1-2)
│
└─ Yes → Is the work divisible?
│
├─ No → Different algorithm needed
│
└─ Yes → Can you avoid coordination?
│
├─ No → Minimize coordination
│
└─ Yes → Parallelize!
6.9 Key Takeaways
Amdahl is optimistic: Real parallel overhead often exceeds the serial fraction
Memory saturates fast: Memory-bound workloads hit bandwidth limits at low thread counts
Coordination grows: Synchronization, communication, and contention scale with parallelism
GPUs are specialized: They excel at regular, data-parallel, high-intensity work—and struggle with everything else
Measure first: Profile to find the bottleneck. If it’s not compute-bound, parallelism isn’t the answer
6.10 Connection to Mathematical Properties
In the coming chapters, we’ll see how mathematical structure determines parallelizability:
- Associativity (Chapter 4): Operations that associate can be reduced in any order—enabling parallel reduction
- Commutativity: Order-invariant operations can be reordered for better locality
- Independence: Separable computations can run in parallel without coordination
The math isn’t abstract—it’s a guide to what can be parallelized without changing the answer.
6.11 Further Reading
- Amdahl, G. (1967). “Validity of the single processor approach to achieving large scale computing capabilities”
- Gustafson, J. (1988). “Reevaluating Amdahl’s Law” (the optimistic counterpoint)
- NVIDIA CUDA Programming Guide - Warp divergence and occupancy