8 Associativity

How Regrouping Enables Parallelism and Streaming

You have a billion numbers. You need their average.

The naive approach: load all billion into memory, sum them, divide.

The problem: a billion float64s is 8 GB. What if you only have 1 GB of RAM?

Historical Note: Egyptian Multiplication (2000 BCE)

The exploitation of associativity for efficiency dates back four millennia. Egyptian scribes computed multiplication using repeated doubling—O(log n) operations instead of O(n) additions. Their insight: because addition is associative, you can regroup (8+4+1)×24 into 8×24 + 4×24 + 1×24, computing each term via doubling.

This is the same principle behind parallel scan, distributed reduction, and FlashAttention’s streaming softmax. The algorithm is ancient; the scale is modern.

8.1 The Property That Enables Everything

Some operations can be computed in any order:

\[a + (b + c) = (a + b) + c\]

This is associativity. It seems abstract, almost trivial. It’s neither.

Associativity is the license to:

Chunk: Process data in pieces
Parallelize: Combine partial results from multiple workers
Stream: Process data as it arrives, without storing everything
Checkpoint: Save intermediate state and resume later

Without associativity, you must process everything at once. With it, you can process anything, no matter how large.

8.2 From Addition to Architecture

Let’s trace how this abstract property becomes concrete performance.

8.2.1 Summing a Billion Numbers

The naive sum:

def naive_sum(numbers):
    total = 0
    for x in numbers:
        total += x
    return total

This works, but it requires all numbers in memory. Can we do better?

Associativity says: the grouping doesn’t matter.

# These are mathematically identical:
(a + b + c + d) + (e + f + g + h)  # Two chunks
((a + b) + (c + d)) + ((e + f) + (g + h))  # Four chunks

So we can process chunks:

def chunked_sum(numbers, chunk_size=1000000):
    total = 0
    for chunk in chunks(numbers, chunk_size):
        total += sum(chunk)  # Process chunk, discard
    return total

Same answer. But now we only need chunk_size numbers in memory, not all of them.

8.2.2 The Combinable State

The key insight: we can represent the “state” of a partial computation as something that combines with more data.

For summation, the state is just the running sum. But this pattern generalizes.

Operation	State	Combine Rule
Sum	`sum`	`sum₁ + sum₂`
Count	`count`	`count₁ + count₂`
Average	`(sum, count)`	`(sum₁+sum₂, count₁+count₂)`
Max	`max`	`max(max₁, max₂)`
Variance	`(sum, sum_sq, count)`	Combine pairwise

The pattern: find the state that makes your operation associative.

8.3 Investigation: The Softmax Challenge

Softmax seems sequential. It needs the global maximum for numerical stability:

\[\text{softmax}(x_i) = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

Naive implementation:

def naive_softmax(x):
    # Shift for numerical stability (subtract max)
    x_max = x.max()  # Need ALL of x
    exp_x = np.exp(x - x_max)
    return exp_x / exp_x.sum()  # Need ALL of exp_x

This requires two passes over all data: 1. Find the maximum 2. Compute exponentials and sum

Can we do it in one pass? Can we stream it?

8.3.1 The Insight: Tracking the Right State

The trick is recognizing that softmax has hidden associative structure.

Consider computing \(\sum_i e^{x_i}\) with numerical stability:

\[\sum_i e^{x_i} = e^{m} \sum_i e^{x_i - m}\]

where \(m = \max_i x_i\).

If we’re streaming and see new elements, the max might change. When it does:

\[e^{m_{old}} \cdot s_{old} = e^{m_{new}} \cdot s_{new}\]

So:

\[s_{new} = s_{old} \cdot e^{m_{old} - m_{new}} + e^{x_{new} - m_{new}}\]

The state is (max, scaled_sum). Here’s the one-pass algorithm:

def streaming_softmax_sum(stream):
    """Compute sum(exp(x)) in one pass, numerically stable."""
    m = float('-inf')  # Running max
    s = 0.0            # Running sum (scaled by current max)

    for x in stream:
        if x > m:
            # Max changed! Rescale the sum.
            s = s * np.exp(m - x) + 1.0
            m = x
        else:
            s = s + np.exp(x - m)

    return m, s  # Final sum is s * exp(m)

Let’s verify this works:

# Test
x = np.array([1.0, 3.0, 2.0, 5.0, 4.0])

# Naive (two-pass)
m_naive = x.max()
s_naive = np.exp(x - m_naive).sum()

# Streaming (one-pass)
m_stream, s_stream = streaming_softmax_sum(x)

print(f"Naive:     max={m_naive}, sum={s_naive:.6f}")
print(f"Streaming: max={m_stream}, sum={s_stream:.6f}")
# Both give: max=5.0, sum=2.486899

8.3.2 The Combine Operation

Now the crucial question: can two partial results be combined?

If we have two chunks with states \((m_1, s_1)\) and \((m_2, s_2)\):

def combine_softmax_states(state1, state2):
    m1, s1 = state1
    m2, s2 = state2
    m = max(m1, m2)

    # Rescale both sums to the new max
    s = s1 * np.exp(m1 - m) + s2 * np.exp(m2 - m)
    return (m, s)

Let’s verify:

# Split the array and combine
x1, x2 = x[:3], x[3:]

# Process each chunk
state1 = streaming_softmax_sum(x1)
state2 = streaming_softmax_sum(x2)

# Combine
m_combined, s_combined = combine_softmax_states(state1, state2)

print(f"Combined:  max={m_combined}, sum={s_combined:.6f}")
# Same result: max=5.0, sum=2.486899

The softmax denominator is associative. Not obviously, but once you find the right state, it combines.

8.4 Preview: From Softmax to FlashAttention

This streaming softmax is the mathematical foundation of FlashAttention.

The key insight: we can extend the (max, sum) state to include the output accumulator. Standard attention needs O(n²) memory to store the attention matrix; FlashAttention uses the streaming approach to reduce this to O(n).

Standard attention memory:
  S = Q @ K.T:  O(n²)  ← The killer
  P (softmax):  O(n²)  ← Also killer

FlashAttention memory:
  State (max, sum, output): O(n × d)
  Per-block intermediates:  O(block_size × d)
  Total:                    O(n × d)

For n = 32,768, d = 128:
  Standard: 4 GB per attention layer
  Flash:    16 MB per attention layer
  Reduction: 256×

The FlashAttention chapter derives FlashAttention in full detail, including:

The complete tiled algorithm
Why failed approaches (sparse attention, gradient checkpointing) don’t solve this problem
The backward pass with recomputation
Hardware-specific tuning

The core insight—streaming softmax via (max, sum) state—is what we’ve developed here.

8.5 The General Pattern

Finding associative structure follows a pattern:

Identify what you need at the end: For softmax, you need the normalized probabilities
Ask what state enables incremental update: For softmax, it’s (max, scaled_sum). For attention, it’s (max, scaled_sum, scaled_output)
Derive the correction factor: When state changes (max increases), how do you update? Usually involves a multiplicative correction
Verify the combine operation: Can you merge two partial states? If yes, you have associativity

8.5.1 Examples Beyond Softmax

Online Variance (Welford’s Algorithm)

The naive variance needs two passes: 1. Compute mean 2. Compute squared deviations from mean

But there’s a one-pass algorithm with state (count, mean, M2):

def welford_update(state, x):
    count, mean, M2 = state
    count += 1
    delta = x - mean
    mean += delta / count
    delta2 = x - mean
    M2 += delta * delta2
    return (count, mean, M2)

def welford_combine(state1, state2):
    """Combine two partial variance computations."""
    n1, mean1, M2_1 = state1
    n2, mean2, M2_2 = state2

    n = n1 + n2
    delta = mean2 - mean1
    mean = mean1 + delta * n2 / n
    M2 = M2_1 + M2_2 + delta * delta * n1 * n2 / n

    return (n, mean, M2)

# Variance is M2 / count

Parallel Prefix Sum

Given [a, b, c, d, e, f, g, h], compute running sums [a, a+b, a+b+c, …].

Seems sequential. But associativity enables a parallel algorithm:

Step 1: Pairwise sums
  [a,   b,   c,   d,   e,   f,   g,   h]
   └─+──┘    └─+──┘    └─+──┘    └─+──┘
  [a, a+b,   c, c+d,   e, e+f,   g, g+h]

Step 2: Sums of pairs
  [a, a+b,   c, c+d,   e, e+f,   g, g+h]
       └──────+──┘         └──────+──┘
  [a, a+b,   c, a..d,  e, e+f,   g, e..h]

Step 3: Continue pattern...

This is the basis of GPU parallel scan, enabling O(log n) depth with O(n) work.

Interactive: Parallel Scan Visualization

Visualize how parallel scan (prefix sum) uses associativity to achieve O(log n) parallel depth. Click “Step” to advance through the algorithm, or use “Auto” to animate.

viewof numElements = Inputs.range([4, 16], {
  value: 8,
  step: 1,
  label: "Number of Elements"
})

viewof scanStep = Inputs.range([0, Math.ceil(Math.log2(numElements)) * 2], {
  value: 0,
  step: 1,
  label: "Algorithm Step"
})

viewof resetBtn = Inputs.button("Reset")

initialValues = Array.from({length: numElements}, (_, i) => i + 1)

// Compute prefix sum at each step
// Up-sweep phase: steps 0 to log2(n)-1
// Down-sweep phase: steps log2(n) to 2*log2(n)-1
numSteps = Math.ceil(Math.log2(numElements))
totalSteps = numSteps * 2

computeScanState = (step) => {
  let values = [...initialValues]
  let highlights = []
  let phase = "initial"
  let description = "Initial values"

  if (step === 0) {
    return { values, highlights, phase, description }
  }

  // Up-sweep (reduce) phase
  if (step <= numSteps) {
    phase = "up-sweep"
    for (let s = 1; s <= step; s++) {
      let stride = Math.pow(2, s)
      for (let i = stride - 1; i < numElements; i += stride) {
        let prevIdx = i - stride / 2
        if (prevIdx >= 0) {
          values[i] = values[i] + values[prevIdx]
        }
      }
    }
    // Highlight current step
    let stride = Math.pow(2, step)
    for (let i = stride - 1; i < numElements; i += stride) {
      let prevIdx = i - stride / 2
      if (prevIdx >= 0) {
        highlights.push({ from: prevIdx, to: i })
      }
    }
    description = `Up-sweep step ${step}: combining elements with stride ${stride}`
  } else {
    // First complete up-sweep
    for (let s = 1; s <= numSteps; s++) {
      let stride = Math.pow(2, s)
      for (let i = stride - 1; i < numElements; i += stride) {
        let prevIdx = i - stride / 2
        if (prevIdx >= 0) {
          values[i] = values[i] + values[prevIdx]
        }
      }
    }

    // Down-sweep phase
    phase = "down-sweep"
    let downStep = step - numSteps

    // Set last to 0 and propagate
    values[numElements - 1] = 0

    for (let s = 1; s <= downStep; s++) {
      let stride = Math.pow(2, numSteps - s + 1)
      for (let i = stride - 1; i < numElements; i += stride) {
        let prevIdx = i - stride / 2
        if (prevIdx >= 0) {
          let temp = values[prevIdx]
          values[prevIdx] = values[i]
          values[i] = values[i] + temp
        }
      }
    }

    // Highlight current down-sweep step
    let stride = Math.pow(2, numSteps - downStep + 1)
    for (let i = stride - 1; i < numElements; i += stride) {
      let prevIdx = i - stride / 2
      if (prevIdx >= 0) {
        highlights.push({ from: i, to: prevIdx })
      }
    }
    description = `Down-sweep step ${downStep}: distributing partial sums`
  }

  return { values, highlights, phase, description }
}

scanState = computeScanState(Math.min(scanStep, totalSteps))

// Expected prefix sum (exclusive)
expectedPrefix = initialValues.reduce((acc, val, idx) => {
  if (idx === 0) return [0]
  return [...acc, acc[acc.length - 1] + initialValues[idx - 1]]
}, [])

html`<div style="margin: 20px 0;">
  <div style="display: flex; justify-content: space-between; align-items: center; margin-bottom: 15px;">
    <div>
      <span style="padding: 4px 12px; border-radius: 20px; font-size: 0.85em; font-weight: 500;
        background: ${scanState.phase === 'initial' ? '#e5e7eb' : scanState.phase === 'up-sweep' ? '#fef3c7' : '#d1fae5'};
        color: ${scanState.phase === 'initial' ? '#374151' : scanState.phase === 'up-sweep' ? '#92400e' : '#065f46'};">
        ${scanState.phase === 'initial' ? 'Initial' : scanState.phase === 'up-sweep' ? 'Up-Sweep (Reduce)' : 'Down-Sweep (Distribute)'}
      </span>
    </div>
    <div style="font-size: 0.85em; color: #666;">
      Step ${scanStep} of ${totalSteps}
    </div>
  </div>

  <div style="margin-bottom: 10px; font-size: 0.9em; color: #374151; font-style: italic;">
    ${scanState.description}
  </div>

  <div style="display: flex; gap: 8px; justify-content: center; margin: 20px 0; position: relative;">
    ${scanState.values.map((val, i) => {
      const isHighlightTo = scanState.highlights.some(h => h.to === i)
      const isHighlightFrom = scanState.highlights.some(h => h.from === i)
      return html`<div style="
        width: 50px; height: 50px;
        display: flex; align-items: center; justify-content: center;
        border-radius: 8px; font-weight: bold; font-size: 1.1em;
        background: ${isHighlightTo ? '#fbbf24' : isHighlightFrom ? '#86efac' : '#e5e7eb'};
        border: 2px solid ${isHighlightTo ? '#d97706' : isHighlightFrom ? '#22c55e' : '#9ca3af'};
        color: ${isHighlightTo || isHighlightFrom ? '#1f2937' : '#374151'};
        transition: all 0.3s ease;
      ">${val}</div>`
    })}
  </div>

  <div style="display: flex; gap: 8px; justify-content: center; margin-top: 30px;">
    <div style="text-align: center;">
      <div style="font-size: 0.75em; color: #666; margin-bottom: 5px;">Input</div>
      <div style="display: flex; gap: 4px;">
        ${initialValues.map(v => html`<div style="width: 30px; height: 30px; display: flex; align-items: center; justify-content: center; background: #f3f4f6; border-radius: 4px; font-size: 0.8em;">${v}</div>`)}
      </div>
    </div>
  </div>

  ${scanStep >= totalSteps ? html`
    <div style="margin-top: 20px; padding: 15px; background: #d1fae5; border-radius: 8px; text-align: center;">
      <div style="font-weight: bold; color: #065f46; margin-bottom: 5px;">Complete! Exclusive prefix sum computed</div>
      <div style="font-size: 0.85em; color: #047857;">
        Input: [${initialValues.join(', ')}] → Output: [${scanState.values.join(', ')}]
      </div>
      <div style="font-size: 0.8em; color: #059669; margin-top: 5px;">
        O(n) work, O(log n) depth — ${numSteps} parallel steps instead of ${numElements - 1} sequential
      </div>
    </div>
  ` : ''}
</div>

<div style="margin-top: 20px; padding: 15px; background: #f9fafb; border-radius: 8px;">
  <div style="font-size: 0.85em; color: #4b5563;">
    <strong>Legend:</strong>
    <span style="margin-left: 15px;">🟡 Yellow = target (receives sum)</span>
    <span style="margin-left: 15px;">🟢 Green = source (adds to target)</span>
  </div>
  <div style="font-size: 0.8em; color: #6b7280; margin-top: 8px;">
    The Blelloch scan algorithm performs two phases: up-sweep builds a reduction tree, down-sweep distributes partial sums. Total: 2·log₂(n) parallel steps.
  </div>
</div>`

Key insight: Because addition is associative, we can reorder the computation. Instead of n-1 sequential adds, we use O(log n) parallel rounds. The up-sweep computes partial sums; the down-sweep distributes them—all enabled by the freedom to regroup.

8.6 When Associativity Breaks

Not everything associates. Recognize these cases:

Median: No associative structure. The median of medians is not the global median.

# Counterexample:
chunk1 = [1, 2, 100]   # median = 2
chunk2 = [3, 4, 5]     # median = 4
combined = [1, 2, 3, 4, 5, 100]  # median = 3.5

# median(2, 4) = 3, which is not the global median (3.5).

Mode: The mode of modes is not the global mode.

Percentiles: Generally not associative (though there are approximate streaming algorithms).

Floating-Point Non-Associativity

Mathematically, addition is associative. In IEEE 754 floating-point, it is not:

>>> (1e-16 + 1.0) - 1.0
0.0
>>> 1e-16 + (1.0 - 1.0)
1e-16

This means (a + b) + c != a + (b + c) in floating-point. Reordering additions (as parallel reduction does) can change results slightly. In practice:

Deep learning: The differences are negligible. SGD noise dwarfs floating-point reordering effects.
Scientific computing: May matter. Kahan summation or compensated algorithms can help.
Reproducibility: Non-deterministic reduction order (GPU, multi-threaded) means results may differ between runs. Use torch.use_deterministic_algorithms(True) when exact reproducibility is needed (at a performance cost).

For most purposes in this book, we treat floating-point operations as associative. See the Numerical Precision appendix for when this breaks down.

8.7 The Hardware Connection

Associativity’s value comes from how it interacts with hardware:

Memory Hierarchy: Chunking lets you fit working sets in cache. FlashAttention’s blocks are sized to fit in GPU SRAM.

Bandwidth: Streaming algorithms read data once rather than multiple times. FlashAttention reduces memory traffic by 4-8× beyond just reducing memory footprint.

Parallelism: Associative operations enable tree reduction, the fundamental parallel primitive.

The Connection:

Mathematical Property     Hardware Constraint      Exploitation
───────────────────────────────────────────────────────────────
Associativity        →    Limited SRAM         →  Tiled/blocked algorithms
                     →    Memory bandwidth     →  Single-pass streaming
                     →    Parallel cores       →  Tree reduction

8.8 Key Takeaways

Associativity is a license: It permits chunking, streaming, and parallelization
The challenge is finding state: The operation itself might not look associative, but there may be hidden structure. Softmax’s (max, sum) is the canonical example
Correction factors are the key: When state changes (max shifts), the correction factor lets you update without recomputing
The pattern is learnable: Ask “what state would let me combine partial results?” This question guides discovery
Hardware rewards associativity: The property aligns with every level of the memory hierarchy and parallel execution

Try It Yourself

The accompanying notebook lets you:

Implement and verify streaming softmax
Explore the combine operation
Build a simplified FlashAttention from scratch
Measure memory savings

Open In Colab

8.9 Exercises

Exercise 1: Is It Associative?

For each operation below, determine whether it’s associative. If yes, describe the monoid (set, operation, identity). If no, explain why.

Matrix multiplication
Maximum of a set
String concatenation
Subtraction
Harmonic mean

Solution: (a) Yes. (matrices, ×, I). Note: associative but NOT commutative. (b) Yes. (numbers, max, -∞). (c) Yes. (strings, concat, ““). Also not commutative. (d) No. (5 - 3) - 1 = 1, but 5 - (3 - 1) = 3. (e) No. The harmonic mean of harmonic means is not the global harmonic mean.

Exercise 2: Design a Streaming Algorithm

Design a streaming algorithm that computes the variance of a sequence in a single pass. What state do you need to maintain?

Solution: Maintain (count, mean, M2) where M2 tracks the sum of squared deviations. Update rule (Welford’s algorithm): n += 1; delta = x - mean; mean += delta/n; delta2 = x - mean; M2 += delta * delta2. Variance = M2/n. The combine operation for merging two partial states (n_a, mean_a, M2_a) and (n_b, mean_b, M2_b) is: n = n_a + n_b; delta = mean_b - mean_a; mean = (n_a × mean_a + n_b × mean_b) / n; M2 = M2_a + M2_b + delta² × n_a × n_b / n. This is associative, enabling parallel computation.

Exercise 3: Chunk Size Tradeoff

In FlashAttention, the tile size (B_r, B_c) must fit in SRAM. For an A100 with 192 KB of shared memory per SM:

If d = 64 and we use FP16, what’s the maximum tile size B_c for K and V tiles (each B_c × d)?
What happens to HBM traffic if you halve the tile size?
Why not just use the largest possible tile?

Solution: (a) Two tiles (K, V): 2 × B_c × 64 × 2 bytes. Plus Q tile, accumulators, etc. Budget ~128 KB for K/V: B_c ≤ 128×1024 / (2×64×2) = 512. (b) Halving tile size doubles the number of tiles, so K/V are each read 2× more from HBM. HBM traffic roughly doubles. (c) Larger tiles reduce parallelism (fewer tiles to distribute across SMs) and may leave some SMs idle. There’s an optimal balance between HBM traffic and SM occupancy.

8.10 Further Reading

Milakov & Gimelshein (2018). “Online normalizer calculation for softmax” - The mathematical foundation
Dao et al. (2022). “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness”
Welford (1962). “Note on a method for calculating corrected sums of squares and products” - The classic streaming variance