viewof matrixSize = Inputs.range([256, 4096], {
value: 2048,
step: 256,
label: "Matrix Size (N×N)"
})
viewof tileSizeMatmul = Inputs.range([16, 256], {
value: 64,
step: 16,
label: "Tile Size (T×T)"
})
viewof cacheSize = Inputs.select([32, 64, 256, 512, 1024], {
value: 256,
label: "Cache Size (KB)"
})15 Investigation: Matrix Multiply
Why This Operation Rules Modern Computing
Two matrix multiply implementations. Same algorithm. Same hardware. Same inputs.
One takes 8 seconds. The other takes 40 milliseconds.
That’s a 200× difference. How?
15.1 The Setup
We have two 2048×2048 matrices of 32-bit floats. We want to compute their product.
import numpy as np
M = K = N = 2048
A = np.random.randn(M, K).astype(np.float32)
B = np.random.randn(K, N).astype(np.float32)
C = np.zeros((M, N), dtype=np.float32)Data sizes: - A: 2048 × 2048 × 4 bytes = 16 MB - B: 2048 × 2048 × 4 bytes = 16 MB - C: 2048 × 2048 × 4 bytes = 16 MB - Total: 48 MB
Operations required: - Each of 2048² output elements requires 2048 multiply-adds - Total: 2048³ × 2 = 17.2 billion FLOPs
15.2 Implementation 1: The Naive Approach
def matmul_naive(A, B, C):
M, K = A.shape
_, N = B.shape
for i in range(M):
for j in range(N):
for k in range(K):
C[i, j] += A[i, k] * B[k, j]
return CLet’s trace what happens for each output element:
Computing C[0, 0]:
for k in range(2048):
C[0, 0] += A[0, k] * B[k, 0]
A accesses: A[0,0], A[0,1], A[0,2], ... A[0,2047]
→ Sequential along row. Good!
B accesses: B[0,0], B[1,0], B[2,0], ... B[2047,0]
→ Strided by 2048 elements (8 KB). Terrible!
The problem is B’s access pattern. In row-major storage:
B in memory:
[B[0,0] B[0,1] B[0,2] ... B[0,2047]] [B[1,0] B[1,1] ... B[1,2047]] ...
Accessing B[:,0] means jumping 2048 elements (8 KB) between accesses.
Each access is a cache miss. We touch 2048 cache lines to read one column.
15.2.1 Benchmark
Naive Python: ~800 seconds (too slow, Numba version)
Naive Numba: ~8 seconds
Expected: 2048³ × 2 / (10 GFLOPS) ≈ 1.7 seconds
We’re getting only 2 GFLOPS—far below even single-core theoretical (~15 GFLOPS for modern CPUs).
15.3 Implementation 2: Loop Reordering
Before tiling, let’s try a simpler fix: reorder the loops.
def matmul_reordered(A, B, C):
M, K = A.shape
_, N = B.shape
for i in range(M):
for k in range(K):
for j in range(N):
C[i, j] += A[i, k] * B[k, j]
return CWait—we just swapped the j and k loops. Same computation, right?
Mathematically, yes. For hardware, completely different:
Computing partial results for row 0:
for k in range(2048):
a_val = A[0, k] # Scalar, stays in register
for j in range(2048):
C[0, j] += a_val * B[k, j]
A accesses: A[0,0], A[0,1], ... (one per outer iteration)
→ Sequential. Good!
B accesses: B[0,0], B[0,1], B[0,2], ... B[0,2047]
then B[1,0], B[1,1], B[1,2], ... B[1,2047]
→ Sequential along rows. Good!
C accesses: C[0,0], C[0,1], ... C[0,2047] (each outer iteration)
→ Sequential along row, repeated. Good!
15.3.1 Benchmark
Reordered Numba: ~2 seconds
Naive Numba: ~8 seconds
Speedup: 4×
Just by reordering loops—no algorithmic change—we get 4× speedup.
But we’re still at only 8 GFLOPS. The hardware can do much more.
15.4 Why We’re Still Slow
Let’s compute the arithmetic intensity:
# Data movement
bytes_read = (M * K + K * N) * 4 # Read A and B once
bytes_written = M * N * 4 # Write C once
total_bytes = bytes_read + bytes_written
# = (16 + 16 + 16) MB = 48 MB
# Computation
flops = M * K * N * 2 # multiply-add = 2 ops
# = 2048³ × 2 = 17.2 billion FLOPs
# Arithmetic intensity
intensity = flops / total_bytes
# = 17.2e9 / 48e6 = 358 FLOPs/byteWait—358 FLOPs/byte is extremely high. We should be compute-bound.
But that calculation assumes we read each input element only once. In our naive loop:
A[i, k] is read once for each j.
Total A reads: M × K × N = 2048³ = 8.6 billion
Unique A elements: M × K = 4 million
B[k, j] is read once for each i.
Total B reads: M × K × N = 2048³ = 8.6 billion
Unique B elements: K × N = 4 million
Reuse factor: 2048× for each element
We’re reading the same data 2048 times from memory because it doesn’t fit in cache.
The effective arithmetic intensity:
Effective bytes moved ≈ 2 × M × K × N × 4 = 68 GB
Effective intensity = 17.2e9 / 68e9 = 0.25 FLOPs/byte
At 0.25 FLOPs/byte, we’re memory-bound. The roofline model says:
Peak memory-bound performance = bandwidth × intensity
= 50 GB/s × 0.25 FLOPs/byte
= 12.5 GFLOPS
Our 8 GFLOPS is close to this limit. We can’t go faster without improving data reuse.
15.5 Implementation 3: Tiling
The solution: process in cache-sized blocks.
def matmul_tiled(A, B, C, tile_size=64):
M, K = A.shape
_, N = B.shape
for i0 in range(0, M, tile_size):
for j0 in range(0, N, tile_size):
for k0 in range(0, K, tile_size):
# Load tiles into (implicit) cache
i1 = min(i0 + tile_size, M)
j1 = min(j0 + tile_size, N)
k1 = min(k0 + tile_size, K)
# Compute tile contribution to C
for i in range(i0, i1):
for k in range(k0, k1):
a_val = A[i, k]
for j in range(j0, j1):
C[i, j] += a_val * B[k, j]
return CWhy does this work?
Tile size = 64
Each tile: 64 × 64 × 4 bytes = 16 KB
Per outer iteration:
- A tile: 16 KB
- B tile: 16 KB
- C tile: 16 KB (read-modify-write)
- Total: 48 KB
L1 cache: 32-64 KB
L2 cache: 256-512 KB
Working set fits in L2, often in L1!
Within each tile, we access the same 16 KB of A, B, C for 64³ = 262,144 operations.
15.5.1 Effective Reuse Analysis
Without tiling:
A[i,k] read M×N times, stored once → reuse = 1
B[k,j] read M×N times, stored once → reuse = 1
With tiling (tile size T):
A[i,k] read (N/T) times, stored once → reuse = N/T = 32×
B[k,j] read (M/T) times, stored once → reuse = M/T = 32×
Effective data movement drops by ~32×. We’re no longer memory-bound.
TipInteractive: Tiled Matrix Multiply Visualizer
Explore how tiling enables data reuse. Adjust matrix and tile sizes to see how working set and reuse factor change.
15.5.2 Benchmark
Tiled Numba (T=64): ~400 ms
Reordered Numba: ~2 seconds
Speedup: 5×
Now at 43 GFLOPS—approaching single-core limits. But BLAS does even better.
15.6 Implementation 4: What BLAS Does
Libraries like OpenBLAS, MKL, and cuBLAS achieve near-peak performance. Here’s how:
15.6.1 1. Multi-level Tiling
BLAS uses multiple tile sizes for multiple cache levels:
L1 tile: 32 × 32 (fits in L1)
L2 tile: 128 × 128 (fits in L2)
L3 tile: 512 × 512 (fits in L3)
Data flows: L3 → L2 → L1 → registers
Each level maximizes reuse before accessing slower memory.
15.6.2 2. Register Tiling
The innermost kernel keeps data in CPU registers:
Inner kernel: 8 × 8 micro-tile
- 8 values from A in registers
- 8 values from B in registers
- 64 accumulators for C in registers
All 64 multiply-adds use register operands.
No memory access in inner loop.
15.6.3 3. SIMD Vectorization
Modern CPUs process multiple floats per instruction:
AVX-512: 16 floats per instruction
AVX2: 8 floats per instruction
SSE: 4 floats per instruction
Peak: 32+ FLOPs per cycle instead of 2
15.6.4 4. Cache-Friendly Packing
Before the main loop, BLAS repacks matrices into contiguous blocks:
def pack_A(A, tile_size):
"""Rearrange A so each tile is contiguous"""
M, K = A.shape
packed = []
for i0 in range(0, M, tile_size):
for k0 in range(0, K, tile_size):
tile = A[i0:i0+tile_size, k0:k0+tile_size]
packed.append(tile.ravel()) # Flatten tile
return np.concatenate(packed)This ensures: - Sequential memory access within tiles - No cache conflicts between tiles - Perfect prefetching
15.6.5 5. Prefetching
BLAS explicitly prefetches the next tile while computing the current one:
Time: |--compute tile 0--|--compute tile 1--|--compute tile 2--|
Data: |--load tile 0-----|--load tile 1-----|--load tile 2-----|
↑ ↑
Prefetch tile 1 Prefetch tile 2
Memory latency is hidden by overlapping with computation.
15.6.6 BLAS Benchmark
NumPy (calls BLAS): ~40 ms
Our best tiled: ~400 ms
Speedup: 10×
BLAS GFLOPS: 430 GFLOPS (multi-core)
Our GFLOPS: 43 GFLOPS (single-core)
15.7 GPU: A Different Beast
GPUs change the game entirely. Let’s see why.
15.7.1 GPU Architecture Recap
GPU (e.g., A100):
- 108 Streaming Multiprocessors (SMs)
- Each SM: 64 CUDA cores
- Total: 6912 cores
- Memory bandwidth: 2 TB/s
- Peak FP32: 19.5 TFLOPS
- SRAM per SM: 192 KB
- HBM (main memory): 80 GB
The key insight: GPUs are massively parallel but have the same memory hierarchy challenge.
15.7.2 Naive GPU Matmul
# One thread per output element
@cuda.jit
def matmul_naive_gpu(A, B, C):
i, j = cuda.grid(2)
if i < C.shape[0] and j < C.shape[1]:
acc = 0.0
for k in range(A.shape[1]):
acc += A[i, k] * B[k, j]
C[i, j] = accThis is slow because: 1. Each thread independently loads A[i,:] and B[:,j] from global memory 2. No data sharing between threads 3. Memory bandwidth saturated, compute idle
15.7.4 Tensor Cores
Modern GPUs have dedicated matrix multiply units:
Tensor Core operation:
D = A × B + C
where A, B are 16×16 matrices (FP16)
C, D are 16×16 matrices (FP32)
One Tensor Core: 256 FLOPs per cycle
A100 has 432 Tensor Cores
Peak: 312 TFLOPS (FP16) or 156 TFLOPS (TF32)
Tensor Cores are purpose-built for matrix multiply. They achieve 10-20× higher throughput than CUDA cores.
15.7.5 GPU Benchmark
Size: 2048×2048
Naive GPU: ~50 ms
Tiled GPU (shared mem): ~5 ms
cuBLAS (Tensor Cores): ~0.5 ms
CPU (single-core): ~400 ms (our best)
CPU (NumPy/BLAS): ~40 ms
GPU (cuBLAS): ~0.5 ms
GPU vs CPU speedup: 80×
15.8 Why Neural Networks Love GEMM
Matrix multiplication (GEMM = General Matrix Multiply) underlies almost every neural network operation:
15.8.1 Fully Connected Layers
# y = Wx + b
def linear(x, W, b):
return x @ W.T + b
# For batch_size=32, input=768, output=3072:
# x: [32, 768], W: [3072, 768]
# This is matmul: [32, 768] × [768, 3072] = [32, 3072]15.8.2 Attention
# Q, K, V projections
Q = input @ W_Q # [seq, dim] × [dim, dim]
K = input @ W_K
V = input @ W_V
# Attention scores
scores = Q @ K.T # [seq, dim] × [dim, seq] = [seq, seq]
# Output
output = softmax(scores) @ V # [seq, seq] × [seq, dim]Five matrix multiplications per attention layer.
15.8.3 Convolutions
# im2col transformation converts convolution to matmul
# Input: [batch, channels, H, W]
# Kernel: [out_channels, in_channels, kH, kW]
# im2col extracts patches: [batch * H_out * W_out, in_channels * kH * kW]
# Kernel reshaped: [out_channels, in_channels * kH * kW]
# Convolution becomes matmul:
# [batch * H_out * W_out, in_channels * kH * kW] × [in_channels * kH * kW, out_channels]This is why GPUs (and TPUs) are optimized for matrix multiply. A 10× improvement in GEMM is a 10× improvement in neural network training.
15.9 The Properties at Work
Let’s identify which properties make fast matrix multiply possible:
15.9.1 Associativity
C[i,j] = Σ_k A[i,k] × B[k,j]
This sum is associative: we can compute in any order.
- Tile-by-tile: Σ over tiles, then Σ within tiles
- Parallel: Different threads compute different partial sums
Associativity enables tiling and parallelism.
15.9.2 Locality
Tiling keeps working set small:
- Process 64×64 tiles
- 48 KB working set fits in L2
- 2048× reuse before eviction
Locality enables cache efficiency.
15.9.3 Separability
The fundamental identity:
(A × B)[i,j] = Σ_k A[i,k] × B[k,j]
= A[i,:] · B[:,j]
Output is the *inner product* of row i of A with column j of B.
The computation separates by output element—each can be computed independently.
This enables parallelism (each thread computes different outputs).
15.9.4 Hardware Co-Design
Tensor Cores are built for matrix multiply:
Input: Two 16×16 matrices
Output: One 16×16 matrix
Latency: A few cycles
This isn't general-purpose hardware accelerating matmul.
This is matmul-shaped hardware.
The algorithm (tiled GEMM) and the hardware (tensor cores, shared memory hierarchy) evolved together.
15.10 Key Takeaways
200× speedup is possible: From naive loops to optimized BLAS/cuBLAS, the same algorithm runs 200× faster. All from exploiting memory hierarchy.
Memory access pattern dominates: Loop reordering alone gives 4× speedup. The algorithm is the same; only the access pattern changes.
Tiling transforms memory-bound to compute-bound: By keeping working sets in cache, we achieve the theoretical arithmetic intensity.
GPU shared memory is like CPU cache: Tiling on GPU uses shared memory (fast SRAM) to enable data reuse across threads.
Hardware is specialized: Tensor Cores exist because matrix multiply is that important. Modern ML is built on GEMM.
All four properties contribute:
- Associativity → parallelism and tiling
- Locality → cache efficiency
- Separability → independent computation per output
- (Sparsity → structured sparse GEMM, Chapter 6)
15.11 The Meta-Lesson
Matrix multiply is the Drosophila of performance engineering. It’s simple enough to understand completely, yet rich enough to demonstrate every optimization technique.
When you see a 200× speedup, you’re seeing: - 4× from access patterns (loop order) - 10× from tiling (cache reuse) - 5× from vectorization (SIMD) - 10× from Tensor Cores (specialized hardware)
These are multiplicative. Miss any one, and you leave 10× on the table.
The best algorithms aren’t just correct. They’re aligned with hardware.
15.12 Further Reading
- Goto & Van De Geijn (2008). “Anatomy of High-Performance Matrix Multiplication”
- Low et al. (2016). “Analytical Modeling Is Enough for High-Performance BLIS”
- Jia et al. (2018). “Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking”
- NVIDIA CUTLASS: https://github.com/NVIDIA/cutlass