4 The Memory Hierarchy

Understanding the True Cost of Data Access

In 1950, John von Neumann described a computer architecture where memory is uniform. Every address takes the same time to access. Every byte is equally close to the processor.

Modern hardware is nothing like this. Memory is a hierarchy—and understanding that hierarchy is the first step to understanding performance.

4.1 The Lie

Computer science education teaches a beautiful abstraction: the Random Access Machine. In this model:

Memory is a flat array of cells
Accessing any cell takes constant time: O(1)
All memory operations are equivalent

This abstraction is elegant. It’s simple. It lets us analyze algorithms without getting lost in hardware details.

And on modern machines, it’s a dangerous lie.

4.2 The Reality

On my laptop, here’s what memory access actually costs:

Location	Latency	Relative to Register
Register	~0.5 ns	1×
L1 Cache	~1 ns	2×
L2 Cache	~4 ns	8×
L3 Cache	~12 ns	24×
DRAM	~100 ns	200×
SSD	~100,000 ns	200,000×

That’s five orders of magnitude from fastest to slowest. The difference between register access and SSD access is the same as the difference between one second and one day.

The RAM model says these are all “O(1).” The machine disagrees.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        THE MEMORY HIERARCHY                                 │
│                                                                             │
│    ┌─────────┐                                                              │
│    │Registers│  ~0.5 ns    Fastest, smallest (~1 KB)                       │
│    │ (1 KB)  │             Compiler-managed                                │
│    └────┬────┘                                                              │
│         │                                                                   │
│    ┌────▼────┐                                                              │
│    │L1 Cache │  ~1 ns      Per-core, split I/D (~64 KB)                    │
│    │ (64 KB) │             Hardware-managed                                │
│    └────┬────┘                                                              │
│         │                                                                   │
│    ┌────▼────┐                                                              │
│    │L2 Cache │  ~4 ns      Per-core (~256 KB - 1 MB)                       │
│    │(256 KB) │             Hardware-managed                                │
│    └────┬────┘                                                              │
│         │                                                                   │
│    ┌────▼────┐                                                              │
│    │L3 Cache │  ~12 ns     Shared across cores (~32 MB)                    │
│    │ (32 MB) │             Hardware-managed                                │
│    └────┬────┘                                                              │
│         │                                                                   │
│    ┌────▼────┐                                                              │
│    │  DRAM   │  ~100 ns    Main memory (~16-256 GB)                        │
│    │ (32 GB) │             OS-managed                                      │
│    └────┬────┘                                                              │
│         │                                                                   │
│    ┌────▼────┐                                                              │
│    │  SSD    │  ~100 μs    Persistent storage (~1-8 TB)                    │
│    │ (1 TB)  │             OS/filesystem-managed                           │
│    └─────────┘                                                              │
│                                                                             │
│    Each level: ~10× slower, ~10× larger                                    │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Interactive: Memory Hierarchy Explorer

Adjust the data size to see where it fits in the memory hierarchy, and understand the performance implications.

viewof dataSize = Inputs.range([1, 10000], {
  value: 100,
  step: 1,
  label: "Data Size (KB)",
  transform: Math.log
})

viewof elementSize = Inputs.select([1, 4, 8, 16], {
  value: 8,
  label: "Element Size (bytes)"
})

// Memory hierarchy levels (typical modern CPU)
levels = [
  { name: "L1 Cache", size: 64, latency: 1, color: "#22c55e" },
  { name: "L2 Cache", size: 512, latency: 4, color: "#84cc16" },
  { name: "L3 Cache", size: 32768, latency: 12, color: "#eab308" },
  { name: "DRAM", size: 32 * 1024 * 1024, latency: 100, color: "#f97316" },
  { name: "SSD", size: 1024 * 1024 * 1024, latency: 100000, color: "#ef4444" }
]

// Find which level the data fits in
fitsIn = {
  for (let level of levels) {
    if (dataSize <= level.size) return level;
  }
  return levels[levels.length - 1];
}

// Calculate metrics
numElements = Math.floor(dataSize * 1024 / elementSize)
effectiveLatency = fitsIn.latency
throughputRatio = levels[0].latency / fitsIn.latency

html`
<div style="font-family: system-ui; padding: 1.5rem; background: linear-gradient(135deg, #1e293b 0%, #334155 100%); border-radius: 12px; color: white;">

  <div style="display: grid; grid-template-columns: repeat(5, 1fr); gap: 8px; margin-bottom: 1.5rem;">
    ${levels.map(l => html`
      <div style="
        padding: 12px 8px;
        border-radius: 8px;
        text-align: center;
        background: ${dataSize <= l.size ? l.color : '#475569'};
        opacity: ${dataSize <= l.size ? 1 : 0.5};
        transition: all 0.3s ease;
      ">
        <div style="font-weight: bold; font-size: 0.75rem;">${l.name}</div>
        <div style="font-size: 0.65rem; opacity: 0.8;">${l.size >= 1024 ? (l.size >= 1048576 ? (l.size/1048576).toFixed(0) + ' GB' : (l.size/1024).toFixed(0) + ' MB') : l.size + ' KB'}</div>
        <div style="font-size: 0.65rem; opacity: 0.8;">${l.latency >= 1000 ? (l.latency/1000).toFixed(0) + ' μs' : l.latency + ' ns'}</div>
      </div>
    `)}
  </div>

  <div style="background: rgba(0,0,0,0.3); padding: 1rem; border-radius: 8px; margin-bottom: 1rem;">
    <div style="display: grid; grid-template-columns: 1fr 1fr; gap: 1rem;">
      <div>
        <div style="font-size: 0.75rem; opacity: 0.7;">Data Size</div>
        <div style="font-size: 1.25rem; font-weight: bold;">${dataSize >= 1024 ? (dataSize/1024).toFixed(1) + ' MB' : dataSize.toFixed(0) + ' KB'}</div>
      </div>
      <div>
        <div style="font-size: 0.75rem; opacity: 0.7;">Elements</div>
        <div style="font-size: 1.25rem; font-weight: bold;">${numElements.toLocaleString()}</div>
      </div>
      <div>
        <div style="font-size: 0.75rem; opacity: 0.7;">Fits In</div>
        <div style="font-size: 1.25rem; font-weight: bold; color: ${fitsIn.color};">${fitsIn.name}</div>
      </div>
      <div>
        <div style="font-size: 0.75rem; opacity: 0.7;">Access Latency</div>
        <div style="font-size: 1.25rem; font-weight: bold;">${fitsIn.latency >= 1000 ? (fitsIn.latency/1000).toFixed(0) + ' μs' : fitsIn.latency + ' ns'}</div>
      </div>
    </div>
  </div>

  <div style="padding: 0.75rem; border-radius: 8px; background: ${fitsIn === levels[0] ? 'rgba(34, 197, 94, 0.2)' : fitsIn === levels[levels.length-1] ? 'rgba(239, 68, 68, 0.2)' : 'rgba(234, 179, 8, 0.2)'};">
    <div style="font-size: 0.875rem;">
      ${fitsIn === levels[0]
        ? "✅ Optimal! Your working set fits in L1 cache. Expect maximum performance."
        : fitsIn === levels[1]
        ? "👍 Good. L2 cache access adds ~3ns latency per access."
        : fitsIn === levels[2]
        ? "⚠️ L3 cache. Consider tiling to improve locality. ~10× slower than L1."
        : fitsIn === levels[3]
        ? "🔴 DRAM access. This is the memory wall. ~100× slower than L1. Restructure for better locality."
        : "❌ Data doesn't fit in RAM. You need streaming or out-of-core algorithms."}
    </div>
  </div>

  <div style="margin-top: 1rem; font-size: 0.75rem; opacity: 0.6; text-align: center;">
    Relative throughput vs L1: ${(throughputRatio * 100).toFixed(1)}%
  </div>
</div>
`

4.3 Why the Hierarchy Exists

The memory hierarchy isn’t an accident. It’s a response to a fundamental physical constraint: the speed of light.

In one nanosecond, light travels about 30 centimeters. Electricity in copper moves slower—roughly 20 cm/ns. Propagation delay is only one contributor, though: DRAM latency is dominated by device timing, refresh, and queueing. The physical distance still makes it impossible for a “fast” compute cycle to wait on far memory.

Additionally, there’s a fundamental tradeoff between speed and capacity:

Fast memory requires expensive, power-hungry circuitry (SRAM)
Large memory requires dense, cheap circuitry (DRAM)
You can’t have both: physics won’t allow it

The hierarchy is the engineering compromise. Keep frequently-used data close (fast, small). Push infrequently-used data far (slow, large).

4.4 The Investigation: Sequential vs. Random Access

Let’s make this concrete with an investigation.

The setup: We have an array of 100 million integers. We want to sum them.

Implementation 1: Sequential access

def sum_sequential(arr):
    total = 0
    for i in range(len(arr)):
        total += arr[i]  # Access arr[0], arr[1], arr[2], ...
    return total

Implementation 2: Random access

import random

def sum_random(arr, indices):
    total = 0
    for i in indices:  # Random order
        total += arr[i]
    return total

# indices is a random permutation of [0, 1, 2, ..., n-1]

Both do the same arithmetic. Both access the same elements. Both are O(n).

The prediction (RAM model): Same performance.

The reality: Let’s measure.

import numpy as np
import time

n = 100_000_000
arr = np.arange(n, dtype=np.int64)
indices = np.random.permutation(n)

# Sequential
start = time.perf_counter()
total_seq = arr.sum()  # NumPy uses sequential access internally
time_seq = time.perf_counter() - start

# Random
start = time.perf_counter()
total_rand = arr[indices].sum()  # Forces random access pattern
time_rand = time.perf_counter() - start

print(f"Sequential: {time_seq:.3f}s")
print(f"Random:     {time_rand:.3f}s")
print(f"Ratio:      {time_rand/time_seq:.1f}×")

On my machine, the results:

Sequential: 0.05s
Random:     0.89s
Ratio:      17.8×

Same algorithm. Same elements. 18× slower.

The RAM model has no explanation for this. The memory hierarchy does.

4.5 Why Sequential Access Wins

When you access arr[0], the CPU doesn’t just fetch that one element. It fetches an entire cache line—typically 64 bytes.

For 8-byte integers, that’s 8 elements. So when you access arr[0], you get arr[0] through arr[7] for free.

┌─────────────────────────────────────────────────────────────────────────────┐
│                        CACHE LINE FETCHING                                  │
│                                                                             │
│   Request: arr[0]                                                           │
│                                                                             │
│   What happens:                                                             │
│   ┌─────────────────────────────────────────────────────────────────────┐   │
│   │ Memory │ 0 │ 1 │ 2 │ 3 │ 4 │ 5 │ 6 │ 7 │ 8 │ 9 │ ...              │   │
│   └────────┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴──────────────────┘   │
│              └───────────────────────────────┘                              │
│                   Entire cache line fetched                                 │
│                       (64 bytes = 8 int64s)                                 │
│                                                                             │
│   Sequential access:                                                        │
│   - Request arr[0] → fetch arr[0:8] → MISS                                 │
│   - Request arr[1] → already in cache → HIT                                │
│   - Request arr[2] → already in cache → HIT                                │
│   - ...                                                                     │
│   - Request arr[7] → already in cache → HIT                                │
│   - Request arr[8] → fetch arr[8:16] → MISS                                │
│                                                                             │
│   Hit rate: 7/8 = 87.5%                                                    │
│                                                                             │
│   Random access:                                                            │
│   - Request arr[47382] → fetch line → MISS                                 │
│   - Request arr[91024] → different line → MISS                             │
│   - Request arr[3847]  → different line → MISS                             │
│   - ...                                                                     │
│                                                                             │
│   Hit rate: ~0% (each access to a different line)                          │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Sequential access: 87.5% cache hit rate. Most accesses are free.

Random access: ~0% cache hit rate. Every access pays the full DRAM latency.

The 18× slowdown is the cost of cache misses.

4.6 The Prefetcher: Hardware Helping You

Modern CPUs have prefetchers—hardware that predicts future accesses and loads data before you ask.

For sequential access, the pattern is obvious: if you accessed addresses 0, 64, 128, the prefetcher guesses you’ll want 192 next. It starts the DRAM fetch in advance, hiding the latency.

For random access, there’s no pattern to predict. The prefetcher is useless. Every access stalls waiting for DRAM.

┌─────────────────────────────────────────────────────────────────────────────┐
│                           PREFETCHING                                       │
│                                                                             │
│   SEQUENTIAL ACCESS:                                                        │
│   Time →                                                                    │
│   ├──────┼──────┼──────┼──────┼──────┼──────┤                              │
│   │Fetch │Fetch │Fetch │Fetch │Fetch │Fetch │                              │
│   │Line 0│Line 1│Line 2│Line 3│Line 4│Line 5│                              │
│   └──────┴──────┴──────┴──────┴──────┴──────┘                              │
│   ├──────┼──────┼──────┼──────┼──────┼──────┤                              │
│   │Use   │Use   │Use   │Use   │Use   │Use   │                              │
│   │Line 0│Line 1│Line 2│Line 3│Line 4│Line 5│                              │
│   └──────┴──────┴──────┴──────┴──────┴──────┘                              │
│                                                                             │
│   Prefetcher fetches ahead. Use and fetch overlap. No stalls.              │
│                                                                             │
│   RANDOM ACCESS:                                                            │
│   Time →                                                                    │
│   ├──────┼─┼──────┼─┼──────┼─┼──────┼─┤                                    │
│   │Fetch │U│Fetch │U│Fetch │U│Fetch │U│                                    │
│   │Line A│ │Line B│ │Line C│ │Line D│ │                                    │
│   └──────┴─┴──────┴─┴──────┴─┴──────┴─┘                                    │
│            ↑        ↑        ↑        ↑                                     │
│          Stall    Stall    Stall    Stall                                   │
│                                                                             │
│   Prefetcher can't predict. CPU stalls waiting for each fetch.             │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

4.7 The Deeper Lesson: Locality

The memory hierarchy rewards a property called locality:

Temporal locality: If you access an address, you’ll likely access it again soon. Keep it in cache.

Spatial locality: If you access an address, you’ll likely access nearby addresses soon. Fetch them together.

Sequential array access has perfect spatial locality. Random access has none.

This insight generalizes far beyond array traversal:

Matrix multiplication can be tiled to keep working sets in cache
Tree traversals can be cache-oblivious with careful layout
Hash tables have poor locality (random access by design)—but can be improved with better hashing
Linked lists have terrible locality (nodes scattered in memory)—this is why arrays usually beat them

The pattern: Algorithms that respect locality beat algorithms that don’t, even if the RAM model says they’re equivalent.

4.8 What This Means for You

The RAM model isn’t useless—it’s a useful first approximation. But for performance-critical code, you need to see deeper.

When analyzing an algorithm, ask:

What’s the access pattern? Sequential? Strided? Random?
What’s the working set size? Does it fit in L1? L2? L3?
Can I restructure to improve locality? Tiling? Layout changes? Blocking?

The answers to these questions often matter more than asymptotic complexity. An O(n²) algorithm with good locality can beat an O(n log n) algorithm with poor locality for practical n.

4.9 Looking Ahead

The memory hierarchy is just the first crack in the RAM model. The next chapter investigates another: the illusion that computation is the bottleneck.

On modern hardware, moving data costs more than computing on it. This leads to a different way of thinking about performance—one measured not in FLOPS, but in bytes.

Try It Yourself

The notebook for this chapter lets you:

Measure sequential vs. random access on your machine
Visualize cache miss rates with different access patterns
Explore how working set size affects performance

Open In Colab

4.10 Key Takeaways

The RAM model is a lie. Memory access time varies by 200× depending on where data lives.
The memory hierarchy exists because of physics. Fast memory is small. Large memory is slow. You can’t have both.
Sequential access beats random access. By 10-20× on typical hardware, due to caching and prefetching.
Locality is the property that matters. Temporal locality (reuse data) and spatial locality (access neighbors) are rewarded by hardware.
Access pattern often matters more than algorithm. An O(n²) algorithm with good locality can beat O(n log n) with poor locality.

4.11 Exercises

Exercise 1: Predict the Ratio

A matrix is stored in row-major order (C layout). You iterate over it in two ways: (a) Row-by-row: for i in range(n): for j in range(n): access(M[i][j]) (b) Column-by-column: for j in range(n): for i in range(n): access(M[i][j])

For n = 4096 and sizeof(element) = 4 bytes with 64-byte cache lines: How many cache lines does each access pattern touch? What speedup do you expect from (a) over (b)?

Solution: Both touch 4096² = 16M elements. (a) accesses sequentially: each cache line (16 floats) is fully used, so 16M/16 = 1M cache line loads, each hitting cache after the first access. (b) accesses with stride 4096 × 4 = 16KB between consecutive accesses — each access likely misses cache. Expected ratio: 10-20× depending on cache size.

Exercise 2: Working Set Estimation

Your model has batch_size=32, seq_len=2048, hidden_dim=4096, dtype=FP16. What is the working set for a single linear layer’s forward pass (input + weight + output)?

Solution: Input: 32 × 2048 × 4096 × 2 = 512 MB. Weight: 4096 × 4096 × 2 = 32 MB. Output: 32 × 2048 × 4096 × 2 = 512 MB. Total working set: ~1 GB. This vastly exceeds any cache level, so the operation is bandwidth-bound.

Exercise 3: Cache-Friendly Transpose

You need to transpose a 4096×4096 float32 matrix. A naive approach reads column-by-column (poor locality). Propose a tiled approach and estimate its speedup.

Solution: Tile the transpose into 64×64 blocks. Each block is 64×64×4 = 16 KB (fits in L1 cache). Process: load a 64×64 tile of the source into a local buffer (sequential reads within each row), transpose in the buffer, write the transposed tile to the output (sequential writes within each row). Both reads and writes are mostly sequential within tiles. Expected speedup: 5-15× over naive, depending on hardware.

4.12 Further Reading

Patterson & Hennessy, Computer Architecture: A Quantitative Approach — The definitive reference on memory hierarchies
Ulrich Drepper, “What Every Programmer Should Know About Memory” — Deep dive into memory systems
Agner Fog, Optimization Manuals — Practical low-level optimization guidance

Next: The Tyranny of Bandwidth →