15  Investigation: Matrix Multiply

Why This Operation Rules Modern Computing

Two matrix multiply implementations. Same algorithm. Same hardware. Same inputs.

One takes 8 seconds. The other takes 40 milliseconds.

That’s a 200× difference. How?

15.1 The Setup

We have two 2048×2048 matrices of 32-bit floats. We want to compute their product.

import numpy as np

M = K = N = 2048
A = np.random.randn(M, K).astype(np.float32)
B = np.random.randn(K, N).astype(np.float32)
C = np.zeros((M, N), dtype=np.float32)

Data sizes: - A: 2048 × 2048 × 4 bytes = 16 MB - B: 2048 × 2048 × 4 bytes = 16 MB - C: 2048 × 2048 × 4 bytes = 16 MB - Total: 48 MB

Operations required: - Each of 2048² output elements requires 2048 multiply-adds - Total: 2048³ × 2 = 17.2 billion FLOPs

15.2 Implementation 1: The Naive Approach

def matmul_naive(A, B, C):
    M, K = A.shape
    _, N = B.shape
    for i in range(M):
        for j in range(N):
            for k in range(K):
                C[i, j] += A[i, k] * B[k, j]
    return C

Let’s trace what happens for each output element:

Computing C[0, 0]:
  for k in range(2048):
      C[0, 0] += A[0, k] * B[k, 0]

  A accesses: A[0,0], A[0,1], A[0,2], ... A[0,2047]
              → Sequential along row. Good!

  B accesses: B[0,0], B[1,0], B[2,0], ... B[2047,0]
              → Strided by 2048 elements (8 KB). Terrible!

The problem is B’s access pattern. In row-major storage:

B in memory:
[B[0,0] B[0,1] B[0,2] ... B[0,2047]] [B[1,0] B[1,1] ... B[1,2047]] ...

Accessing B[:,0] means jumping 2048 elements (8 KB) between accesses.
Each access is a cache miss. We touch 2048 cache lines to read one column.

15.2.1 Benchmark

Naive Python:     ~800 seconds (too slow, Numba version)
Naive Numba:      ~8 seconds
Expected:         2048³ × 2 / (10 GFLOPS) ≈ 1.7 seconds

We’re getting only 2 GFLOPS—far below even single-core theoretical (~15 GFLOPS for modern CPUs).

15.3 Implementation 2: Loop Reordering

Before tiling, let’s try a simpler fix: reorder the loops.

def matmul_reordered(A, B, C):
    M, K = A.shape
    _, N = B.shape
    for i in range(M):
        for k in range(K):
            for j in range(N):
                C[i, j] += A[i, k] * B[k, j]
    return C

Wait—we just swapped the j and k loops. Same computation, right?

Mathematically, yes. For hardware, completely different:

Computing partial results for row 0:
  for k in range(2048):
      a_val = A[0, k]  # Scalar, stays in register
      for j in range(2048):
          C[0, j] += a_val * B[k, j]

  A accesses: A[0,0], A[0,1], ... (one per outer iteration)
              → Sequential. Good!

  B accesses: B[0,0], B[0,1], B[0,2], ... B[0,2047]
              then B[1,0], B[1,1], B[1,2], ... B[1,2047]
              → Sequential along rows. Good!

  C accesses: C[0,0], C[0,1], ... C[0,2047] (each outer iteration)
              → Sequential along row, repeated. Good!

15.3.1 Benchmark

Reordered Numba:  ~2 seconds
Naive Numba:      ~8 seconds
Speedup:          4×

Just by reordering loops—no algorithmic change—we get 4× speedup.

But we’re still at only 8 GFLOPS. The hardware can do much more.

15.4 Why We’re Still Slow

Let’s compute the arithmetic intensity:

# Data movement
bytes_read = (M * K + K * N) * 4  # Read A and B once
bytes_written = M * N * 4         # Write C once
total_bytes = bytes_read + bytes_written
# = (16 + 16 + 16) MB = 48 MB

# Computation
flops = M * K * N * 2  # multiply-add = 2 ops
# = 2048³ × 2 = 17.2 billion FLOPs

# Arithmetic intensity
intensity = flops / total_bytes
# = 17.2e9 / 48e6 = 358 FLOPs/byte

Wait—358 FLOPs/byte is extremely high. We should be compute-bound.

But that calculation assumes we read each input element only once. In our naive loop:

A[i, k] is read once for each j.
  Total A reads: M × K × N = 2048³ = 8.6 billion
  Unique A elements: M × K = 4 million

B[k, j] is read once for each i.
  Total B reads: M × K × N = 2048³ = 8.6 billion
  Unique B elements: K × N = 4 million

Reuse factor: 2048× for each element

We’re reading the same data 2048 times from memory because it doesn’t fit in cache.

The effective arithmetic intensity:

Effective bytes moved ≈ 2 × M × K × N × 4 = 68 GB
Effective intensity = 17.2e9 / 68e9 = 0.25 FLOPs/byte

At 0.25 FLOPs/byte, we’re memory-bound. The roofline model says:

Peak memory-bound performance = bandwidth × intensity
                              = 50 GB/s × 0.25 FLOPs/byte
                              = 12.5 GFLOPS

Our 8 GFLOPS is close to this limit. We can’t go faster without improving data reuse.

15.5 Implementation 3: Tiling

The solution: process in cache-sized blocks.

def matmul_tiled(A, B, C, tile_size=64):
    M, K = A.shape
    _, N = B.shape

    for i0 in range(0, M, tile_size):
        for j0 in range(0, N, tile_size):
            for k0 in range(0, K, tile_size):
                # Load tiles into (implicit) cache
                i1 = min(i0 + tile_size, M)
                j1 = min(j0 + tile_size, N)
                k1 = min(k0 + tile_size, K)

                # Compute tile contribution to C
                for i in range(i0, i1):
                    for k in range(k0, k1):
                        a_val = A[i, k]
                        for j in range(j0, j1):
                            C[i, j] += a_val * B[k, j]

    return C

Why does this work?

Tile size = 64
Each tile: 64 × 64 × 4 bytes = 16 KB

Per outer iteration:
  - A tile: 16 KB
  - B tile: 16 KB
  - C tile: 16 KB (read-modify-write)
  - Total: 48 KB

L1 cache: 32-64 KB
L2 cache: 256-512 KB

Working set fits in L2, often in L1!

Within each tile, we access the same 16 KB of A, B, C for 64³ = 262,144 operations.

15.5.1 Effective Reuse Analysis

Without tiling:

A[i,k] read M×N times, stored once → reuse = 1
B[k,j] read M×N times, stored once → reuse = 1

With tiling (tile size T):

A[i,k] read (N/T) times, stored once → reuse = N/T = 32×
B[k,j] read (M/T) times, stored once → reuse = M/T = 32×

Effective data movement drops by ~32×. We’re no longer memory-bound.

Explore how tiling enables data reuse. Adjust matrix and tile sizes to see how working set and reuse factor change.

15.5.2 Benchmark

Tiled Numba (T=64):  ~400 ms
Reordered Numba:     ~2 seconds
Speedup:             5×

Now at 43 GFLOPS—approaching single-core limits. But BLAS does even better.

15.6 Implementation 4: What BLAS Does

Libraries like OpenBLAS, MKL, and cuBLAS achieve near-peak performance. Here’s how:

15.6.1 1. Multi-level Tiling

BLAS uses multiple tile sizes for multiple cache levels:

L1 tile:  32 × 32 (fits in L1)
L2 tile:  128 × 128 (fits in L2)
L3 tile:  512 × 512 (fits in L3)

Data flows: L3 → L2 → L1 → registers
Each level maximizes reuse before accessing slower memory.

15.6.2 2. Register Tiling

The innermost kernel keeps data in CPU registers:

Inner kernel: 8 × 8 micro-tile
  - 8 values from A in registers
  - 8 values from B in registers
  - 64 accumulators for C in registers

  All 64 multiply-adds use register operands.
  No memory access in inner loop.

15.6.3 3. SIMD Vectorization

Modern CPUs process multiple floats per instruction:

AVX-512:  16 floats per instruction
AVX2:     8 floats per instruction
SSE:      4 floats per instruction

Peak: 32+ FLOPs per cycle instead of 2

15.6.4 4. Cache-Friendly Packing

Before the main loop, BLAS repacks matrices into contiguous blocks:

def pack_A(A, tile_size):
    """Rearrange A so each tile is contiguous"""
    M, K = A.shape
    packed = []
    for i0 in range(0, M, tile_size):
        for k0 in range(0, K, tile_size):
            tile = A[i0:i0+tile_size, k0:k0+tile_size]
            packed.append(tile.ravel())  # Flatten tile
    return np.concatenate(packed)

This ensures: - Sequential memory access within tiles - No cache conflicts between tiles - Perfect prefetching

NoteZero-copy transpose: row vs column major

BLAS/cuBLAS expects column-major layout. Most C/C++/Python arrays are row-major. The key identity is that a row-major matrix has the same byte layout as its transpose in column-major. For \(C = AB\):

\[A_{\text{row}} \equiv (A^T)_{\text{col}}, \quad B_{\text{row}} \equiv (B^T)_{\text{col}}\] \[C^T = B^T A^T\]

So you can swap operands and flip transpose flags to get the same result, with no explicit transpose or data movement. This is why frameworks “just work” with cuBLAS: they pass leading dimensions and op flags, not extra copy kernels.

Caveat: This is free only for contiguous (or simple stride) layouts. For non-contiguous views, you may need a real copy or a different kernel.

15.6.5 5. Prefetching

BLAS explicitly prefetches the next tile while computing the current one:

Time: |--compute tile 0--|--compute tile 1--|--compute tile 2--|
Data: |--load tile 0-----|--load tile 1-----|--load tile 2-----|
                     ↑                  ↑
                     Prefetch tile 1    Prefetch tile 2

Memory latency is hidden by overlapping with computation.

15.6.6 BLAS Benchmark

NumPy (calls BLAS):   ~40 ms
Our best tiled:       ~400 ms
Speedup:              10×

BLAS GFLOPS:          430 GFLOPS (multi-core)
Our GFLOPS:           43 GFLOPS (single-core)

15.7 GPU: A Different Beast

GPUs change the game entirely. Let’s see why.

15.7.1 GPU Architecture Recap

GPU (e.g., A100):
  - 108 Streaming Multiprocessors (SMs)
  - Each SM: 64 CUDA cores
  - Total: 6912 cores
  - Memory bandwidth: 2 TB/s
  - Peak FP32: 19.5 TFLOPS
  - SRAM per SM: 192 KB
  - HBM (main memory): 80 GB

The key insight: GPUs are massively parallel but have the same memory hierarchy challenge.

15.7.2 Naive GPU Matmul

# One thread per output element
@cuda.jit
def matmul_naive_gpu(A, B, C):
    i, j = cuda.grid(2)
    if i < C.shape[0] and j < C.shape[1]:
        acc = 0.0
        for k in range(A.shape[1]):
            acc += A[i, k] * B[k, j]
        C[i, j] = acc

This is slow because: 1. Each thread independently loads A[i,:] and B[:,j] from global memory 2. No data sharing between threads 3. Memory bandwidth saturated, compute idle

15.7.3 Tiled GPU Matmul (Shared Memory)

TILE_SIZE = 32

@cuda.jit
def matmul_tiled_gpu(A, B, C):
    # Shared memory for tiles
    sA = cuda.shared.array((TILE_SIZE, TILE_SIZE), dtype=float32)
    sB = cuda.shared.array((TILE_SIZE, TILE_SIZE), dtype=float32)

    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y

    row = by * TILE_SIZE + ty
    col = bx * TILE_SIZE + tx

    acc = 0.0

    for k0 in range(0, A.shape[1], TILE_SIZE):
        # Cooperatively load tiles into shared memory
        if row < A.shape[0] and k0 + tx < A.shape[1]:
            sA[ty, tx] = A[row, k0 + tx]
        if k0 + ty < B.shape[0] and col < B.shape[1]:
            sB[ty, tx] = B[k0 + ty, col]

        cuda.syncthreads()

        # Compute partial result using shared memory
        for k in range(TILE_SIZE):
            acc += sA[ty, k] * sB[k, tx]

        cuda.syncthreads()

    if row < C.shape[0] and col < C.shape[1]:
        C[row, col] = acc

What’s happening: 1. Each thread block loads a tile of A and B into shared memory (fast SRAM) 2. All 32×32 = 1024 threads in the block share these tiles 3. Each thread computes one output element using the shared tiles 4. Tiles are loaded once, used 32× (by each thread in a row/column)

15.7.4 Tensor Cores

Modern GPUs have dedicated matrix multiply units:

Tensor Core operation:
  D = A × B + C
  where A, B are 16×16 matrices (FP16)
        C, D are 16×16 matrices (FP32)

  One Tensor Core: 256 FLOPs per cycle
  A100 has 432 Tensor Cores
  Peak (theoretical, A100 SXM): 312 TFLOPS (FP16) or 156 TFLOPS (TF32)

Tensor Cores are purpose-built for matrix multiply. They achieve 10-20× higher throughput than CUDA cores.

15.7.5 GPU Benchmark

Size: 2048×2048

Naive GPU:              ~50 ms
Tiled GPU (shared mem): ~5 ms
cuBLAS (Tensor Cores):  ~0.5 ms

CPU (single-core):      ~400 ms (our best)
CPU (NumPy/BLAS):       ~40 ms
GPU (cuBLAS):           ~0.5 ms

GPU vs CPU speedup:     80×

15.8 Why Neural Networks Love GEMM

Matrix multiplication (GEMM = General Matrix Multiply) underlies almost every neural network operation:

15.8.1 Fully Connected Layers

# y = Wx + b
def linear(x, W, b):
    return x @ W.T + b

# For batch_size=32, input=768, output=3072:
# x: [32, 768], W: [3072, 768]
# This is matmul: [32, 768] × [768, 3072] = [32, 3072]

15.8.2 Attention

# Q, K, V projections
Q = input @ W_Q  # [seq, dim] × [dim, dim]
K = input @ W_K
V = input @ W_V

# Attention scores
scores = Q @ K.T  # [seq, dim] × [dim, seq] = [seq, seq]

# Output
output = softmax(scores) @ V  # [seq, seq] × [seq, dim]

Five matrix multiplications per attention layer.

15.8.3 Convolutions

# im2col transformation converts convolution to matmul
# Input: [batch, channels, H, W]
# Kernel: [out_channels, in_channels, kH, kW]

# im2col extracts patches: [batch * H_out * W_out, in_channels * kH * kW]
# Kernel reshaped: [out_channels, in_channels * kH * kW]

# Convolution becomes matmul:
# [batch * H_out * W_out, in_channels * kH * kW] × [in_channels * kH * kW, out_channels]

This is why GPUs (and TPUs) are optimized for matrix multiply. A 10× improvement in GEMM is a 10× improvement in neural network training.

15.9 The Properties at Work

Let’s identify which properties make fast matrix multiply possible:

15.9.1 Associativity

C[i,j] = Σ_k A[i,k] × B[k,j]

This sum is associative: we can compute in any order.
- Tile-by-tile: Σ over tiles, then Σ within tiles
- Parallel: Different threads compute different partial sums

Associativity enables tiling and parallelism.

15.9.2 Locality

Tiling keeps working set small:
  - Process 64×64 tiles
  - 48 KB working set fits in L2
  - 2048× reuse before eviction

Locality enables cache efficiency.

15.9.3 Separability

The fundamental identity:

(A × B)[i,j] = Σ_k A[i,k] × B[k,j]
            = A[i,:] · B[:,j]

Output is the *inner product* of row i of A with column j of B.

The computation separates by output element—each can be computed independently.

This enables parallelism (each thread computes different outputs).

15.9.4 Hardware Co-Design

Tensor Cores are built for matrix multiply:

Input: Two 16×16 matrices
Output: One 16×16 matrix
Latency: A few cycles

This isn't general-purpose hardware accelerating matmul.
This is matmul-shaped hardware.

The algorithm (tiled GEMM) and the hardware (tensor cores, shared memory hierarchy) evolved together.

15.10 Property Audit

Property Applies? How it’s exploited
Associativity Yes Matmul can be decomposed into independent sub-problems (tiles) and recombined — the result is the same regardless of computation order
Separability Partial Each output element is an inner product (separable across the K dimension). Strassen-type algorithms exploit deeper factorizations
Locality Primary Tiling keeps working sets in cache/SRAM. The 200× speedup is almost entirely from improving data locality
Sparsity No Dense matmul — no zeros to skip
Redundancy Partial FP16/INT8 tensor cores exploit that full FP32 precision is often unnecessary
Symmetry No General matrices have no symmetry to exploit (symmetric/Hermitian matrices do)

Dominant property: Locality — the entire optimization journey from naive to cuBLAS is about keeping data in fast memory.

15.11 Key Takeaways

  1. 200× speedup is possible: From naive loops to optimized BLAS/cuBLAS, the same algorithm runs 200× faster. All from exploiting memory hierarchy.

  2. Memory access pattern dominates: Loop reordering alone gives 4× speedup. The algorithm is the same; only the access pattern changes.

  3. Tiling transforms memory-bound to compute-bound: By keeping working sets in cache, we achieve the theoretical arithmetic intensity.

  4. GPU shared memory is like CPU cache: Tiling on GPU uses shared memory (fast SRAM) to enable data reuse across threads.

  5. Hardware is specialized: Tensor Cores exist because matrix multiply is that important. Modern ML is built on GEMM.

  6. All four properties contribute:

    • Associativity → parallelism and tiling
    • Locality → cache efficiency
    • Separability → independent computation per output
    • (Sparsity → structured sparse GEMM, Skipping)

15.12 The Meta-Lesson

Matrix multiply is the Drosophila of performance engineering. It’s simple enough to understand completely, yet rich enough to demonstrate every optimization technique.

When you see a 200× speedup, you’re seeing: - 4× from access patterns (loop order) - 10× from tiling (cache reuse) - 5× from vectorization (SIMD) - 10× from Tensor Cores (specialized hardware)

These are multiplicative. Miss any one, and you leave 10× on the table.

The best algorithms aren’t just correct. They’re aligned with hardware.

NoteTry It Yourself

The accompanying notebook lets you:

  • Implement and benchmark naive, reordered, and tiled matmul
  • Visualize cache behavior with different tile sizes
  • Compare CPU vs GPU performance
  • Profile memory bandwidth vs compute utilization

Open In Colab

15.13 Further Reading

  • Goto & Van De Geijn (2008). “Anatomy of High-Performance Matrix Multiplication”
  • Low et al. (2016). “Analytical Modeling Is Enough for High-Performance BLIS”
  • Jia et al. (2018). “Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking”
  • NVIDIA CUTLASS: https://github.com/NVIDIA/cutlass