3 The Algebraic Framework

A Unified Theory of Performance Optimization

Every performance optimization exploits a mathematical property. Understanding these properties transforms optimization from art to science.

This chapter establishes the framework that unifies the entire book.

Building on Array-Oriented Thinking

This chapter provides the vocabulary for optimization: the six properties that enable performance transformations.

The previous chapter provided the mental model: thinking in arrays exposes structure that loops hide. Together, they form a complete foundation—you need the mental model to see structure, and the vocabulary to name and exploit it.

3.1 The Central Claim

This book makes a bold claim:

Every significant algorithmic optimization can be traced to a mathematical property of the computation.

Not “some optimizations.” Not “many optimizations.” Every significant algorithmic one.

This isn’t mysticism—it’s recognition that algorithms operate on mathematical structures, and those structures have properties that either enable or preclude certain transformations.

(Note: low-level optimizations like SIMD instruction selection, register allocation, and branch prediction operate at the execution model level rather than the mathematical structure level. This framework addresses the higher-level question: why is a fundamentally different algorithm possible?)

3.2 The Optimization Equation

Performance optimization solves a fundamental equation:

An optimization is a transformation that preserves the result while using fewer resources.

The challenge: How do we know a transformation preserves the result?

Answer: Mathematical properties guarantee it.

When we know a computation has a particular property, we know certain transformations are legal—they preserve correctness by mathematical necessity, not by accident.

3.3 The Six Fundamental Properties

Through extensive study of performance optimizations across domains, six properties emerge as fundamental.

A Note on Reversibility

Part II also covers reversibility (invertibility of computations), which enables memory-efficient training by recomputing rather than storing activations. While not one of the six “structural” properties that determine what work to do, reversibility determines what to remember—a complementary optimization axis. It appears in Part II as a bonus chapter after the core six.

THE PERIODIC TABLE OF OPTIMIZATION

╔═══════════════════════════════════════════════════════════════════════╗
║  TIER 1: ALGEBRAIC PROPERTIES                                       ║
║  Properties of the OPERATIONS — guarantee correctness of transforms ║
╠═════════════════════════════╦═══════════════════════════════════════╣
║                             ║                                       ║
║  ASSOCIATIVITY              ║  SEPARABILITY                        ║
║  (a ⊕ b) ⊕ c = a ⊕ (b ⊕ c)  ║  A = UV^T (low-rank decomposition)   ║
║                             ║                                       ║
║  Enables:                   ║  Enables:                             ║
║  • Chunking / streaming     ║  • Factorization                     ║
║  • Parallelization          ║  • Low-rank approximation            ║
║  • Reduction trees          ║  • Decomposition                     ║
║                             ║                                       ║
║  You can PROVE the          ║  You can PROVE the factored           ║
║  reordered result equals    ║  computation equals the original      ║
║  the original               ║                                       ║
║                             ║                                       ║
║  FlashAttention, prefix     ║  LoRA, depthwise separable conv,     ║
║  scan, online softmax       ║  SVD compression                     ║
╠═════════════════════════════╩═══════════════════════════════════════╣
║  TIER 2: STRUCTURAL PROPERTIES                                      ║
║  Properties of the DATA/PROBLEM — enable empirical exploitation     ║
╠════════════════╦════════════════╦════════════════╦═══════════════════╣
║                ║                ║                ║                   ║
║  SPARSITY      ║  REDUNDANCY    ║  LOCALITY      ║  SYMMETRY        ║
║  Most values   ║  Info content  ║  Dependencies  ║  Invariance      ║
║  are zero      ║  < repr. size  ║  are local     ║  under transform ║
║                ║                ║                ║                   ║
║  Skip work     ║  Compress      ║  Tile/block    ║  Share params    ║
║  MoE, pruning  ║  Quantization  ║  Cache reuse   ║  CNNs, Fourier   ║
║                ║  KV cache      ║  Matmul tiling ║  Group equiv.    ║
╚════════════════╩════════════════╩════════════════╩═══════════════════╝

REVERSIBILITY (bonus): Invertible computations enable memory savings
via recomputation. Complements the six properties above.

The two tiers reflect a fundamental distinction:

Tier 1 (Algebraic) properties are about the operations: you can look at a mathematical expression and determine whether it has them. When present, they guarantee that a transformation preserves correctness. Associativity means you can reorder; separability means you can factorize.
Tier 2 (Structural) properties are about the data or problem: they describe the shape of the input, the access pattern, or the problem’s symmetries. They enable exploitation but require empirical validation — quantizing to INT4 usually works because weights are redundant, but you must measure quality to confirm.

3.3.1 Property 1: Associativity

Definition: An operation \(\oplus\) is associative if \((a \oplus b) \oplus c = a \oplus (b \oplus c)\).

Why it matters: Associativity means we can regroup without changing the result. This unlocks:

Chunking: Process in pieces, combine results
Streaming: Process incrementally, never holding everything in memory
Parallelization: Different processors handle different groups, merge at end
Pipelining: Start on next chunk before finishing current

The FlashAttention insight [1]:

Standard softmax appears non-associative: \[\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

The denominator requires knowing all values. How can we chunk?

The insight: Softmax is associative with the right state representation: \[\text{state} = (\max, \sum e^{x-\max}, \sum e^{x-\max} \cdot v)\]

This (max, sum, weighted_sum) triple forms a monoid—it has associative combination. That’s why FlashAttention works.

Recognizing associativity:

Ask: “If I split this computation in half, can I combine the halves?”

Sum: Yes → (a+b) + (c+d) = a+b+c+d ✓
Max: Yes → max(max(a,b), max(c,d)) = max(a,b,c,d) ✓
Mean: Yes, with count → (sum₁, n₁) ⊕ (sum₂, n₂) = (sum₁+sum₂, n₁+n₂) ✓
Median: No → median of medians ≠ median ✗
Softmax: Yes, with (max, sum, output) state ✓

3.3.2 Property 2: Separability

Definition: A function is separable if \(f(x, y) = g(x) \cdot h(y)\) for some \(g\), \(h\).

A matrix is separable (rank-1) if \(A = uv^T\). More generally, low-rank if \(A = UV^T\) with small inner dimension.

Why it matters: Separable computations decompose into independent parts:

Factorization: Store/compute factors instead of full matrix
Decomposition: Split complex operations into simpler ones
Independence: Compute factors in parallel

The LoRA insight [2]:

Fine-tuning updates the weight matrix: \(W' = W + \Delta W\).

Key observation: \(\Delta W\) is often low-rank. If \(\Delta W \approx BA\) where \(B \in \mathbb{R}^{d \times r}\), \(A \in \mathbb{R}^{r \times k}\):

Full update: \(d \times k\) parameters (millions)
LoRA update: \(r \times (d + k)\) parameters (thousands when \(r \ll d, k\))

This works because fine-tuning doesn’t need the full expressivity of the weight space—it moves in a low-dimensional subspace.

Recognizing separability:

Ask: “Does this matrix/function have structure I’m not exploiting?”

Weight updates in fine-tuning → Low-rank (LoRA)
Convolution kernels → Separable filters (depthwise separable)
Attention patterns → Low-rank attention (Linformer)
Covariance matrices → Often low effective rank

3.3.3 Property 3: Locality

Definition: A function has locality if \(f(x_i)\) depends only on \(x_j\) where \(|i - j| < k\) for some small \(k\).

More generally: nearby inputs affect nearby outputs more than distant ones.

Why it matters: Locality means we can tile—process local regions independently:

Blocking: Work on cache-sized blocks
Tiling: Reuse data while it’s in fast memory
Reordering: Process in cache-friendly order
Distribution: Assign spatial regions to different processors

The tiled matrix multiply insight:

Naive matmul: For each output element, access a full row of A and column of B. \[C_{ij} = \sum_k A_{ik} \cdot B_{kj}\]

Problem: Rows of A and columns of B don’t fit in cache. Every access goes to DRAM.

Tiled matmul: Process in blocks that fit in cache.

For each block (i,j) of C:
    For each block k:
        Load A_block[i,k] into cache  # Fits!
        Load B_block[k,j] into cache  # Fits!
        C_block[i,j] += A_block @ B_block

The mathematical property: Matrix multiplication has locality in block structure. The product of blocks equals the block of products.

Recognizing locality:

Ask: “What’s the dependency radius of this computation?”

Convolution → Kernel-sized locality ✓
Matmul → Block locality (with reordering) ✓
Global attention → No locality (everything depends on everything) ✗
Local attention → Window-sized locality ✓
FFT → Recursive locality (butterfly structure) ✓

3.3.4 Property 4: Sparsity

Definition: A structure is sparse if most of its elements are zero or negligible.

Why it matters: Sparsity means we can skip—avoid computation on zeros:

Zero skipping: Don’t multiply by zero
Conditional computation: Only activate relevant parts
Pruning: Remove unimportant weights
Selection: Choose subset of elements

The MoE insight:

Standard dense layer: Every input activates every parameter.

\[y = Wx \quad \text{(all parameters used)}\]

MoE observation: Different inputs need different “experts.”

\[y = \sum_{i \in \text{top-}k} g_i \cdot E_i(x) \quad \text{(only k experts used)}\]

The sparsity is in activation: most experts are inactive for any given input. This allows scaling parameters without scaling compute.

Recognizing sparsity:

Ask: “Is the effective computation smaller than the apparent computation?”

Attention matrices → Often sparse in practice (local patterns)
Network activations → ReLU creates sparsity
Expert routing → Only k of n experts active
Weight matrices → Can be pruned to 90%+ sparsity
Gradient updates → Often sparse (only some parameters matter)

3.3.5 Property 5: Redundancy

Definition: A representation has redundancy if it uses more bits than necessary to encode the information.

Why it matters: Redundancy means we can compress—use fewer bits without losing information:

Quantization: Fewer bits per value
Caching: Don’t recompute what’s already computed
Deduplication: Store shared structure once
Approximation: Discard redundant precision

The quantization insight:

Neural network weights are stored as FP32 (32 bits). But do they need 32 bits of precision?

Observation: Weights are highly redundant. - They cluster around certain values - High bits of precision rarely matter for accuracy - Relative relationships matter more than absolute values

This is why INT8 and even INT4 quantization works: the information content of weights is far less than 32 bits per weight.

The KV cache insight:

In autoregressive generation, we recompute attention over all previous tokens.

Observation: Keys and values for previous tokens don’t change.

This is pure redundancy—recomputing the same values. KV caching eliminates this redundancy by storing previously computed values.

Recognizing redundancy:

Ask: “Am I representing or computing more than necessary?”

Repeated computation → Cache it (KV cache, prefix cache)
High-precision storage → Quantize it (INT8, INT4)
Repeated prefixes → Share them (RadixAttention)
Similar vectors → Cluster them (vocabulary quantization)

3.3.6 Property 6: Symmetry

Definition: A function has symmetry if \(f(g(x)) = h(f(x))\) for some transformations \(g\), \(h\).

The most common case: \(f(g(x)) = f(x)\) (invariance) or \(f(g(x)) = g(f(x))\) (equivariance).

Why it matters: Symmetry means we can share—use the same computation for transformed inputs:

Weight sharing: Same weights for shifted inputs (convolution)
Equivariant architectures: Exploit known symmetries
Symmetry reduction: Reduce problem size by factoring out symmetry

The convolution insight:

Images have translation symmetry: a cat in the corner is the same cat as in the center.

Standard approach: Learn separate detectors for every position.

Convolutional approach: Learn one detector, apply everywhere.

\[\text{Parameters: } O(H \times W \times C^2) \rightarrow O(K^2 \times C^2)\]

The weight sharing exploits translation symmetry.

Recognizing symmetry:

Ask: “What transformations leave this problem essentially unchanged?”

Image classification → Translation, (some) rotation
Sequence modeling → Time shift
Graph neural networks → Permutation of nodes
Physics simulation → Physical symmetries (conservation laws)

3.4 The Mathematical Heritage

These properties aren’t new discoveries. They’re ancient insights, refined over millennia:

TIMELINE OF ALGORITHMIC INSIGHT

2000 BCE │ Egyptian multiplication: O(log n) via doubling
         │ → First use of associativity for efficiency
         │
 300 BCE │ Euclid's algorithm: GCD via iterative reduction
         │ → First use of structure-preserving transformation
         │
    1801 │ Gauss's Disquisitiones Arithmeticae
         │ → Modular arithmetic formalized, enabling cryptography
         │
   1920s │ Emmy Noether's abstract algebra
         │ → Groups, rings, structures independent of elements
         │
   1990s │ Stepanov's STL and generic programming
         │ → "The algorithm isn't about numbers—it's about the property"
         │
   2022+ │ FlashAttention, parallel scan for ML
         │ → Ancient principles at unprecedented scale

3.4.1 The Egyptian Insight

Four thousand years ago, Egyptian scribes computed multiplication using only doubling and addition:

To compute 13 × 24:

Power of 2  │  Doubled value  │  Include?
──────────────────────────────────────────
     1      │      24         │  ✓ (13 = 8+4+1)
     2      │      48         │
     4      │      96         │  ✓
     8      │     192         │  ✓
──────────────────────────────────────────
Result: 24 + 96 + 192 = 312

This is binary exponentiation—O(log n) operations instead of O(n).

The Egyptians understood something profound: because addition is associative, you can regroup the computation. The same insight underlies parallel scan, FlashAttention, and distributed training.

3.4.2 Noether’s Revolution

Emmy Noether (1882-1935) transformed mathematics by focusing on structure over elements. Her insight: properties like associativity aren’t incidental—they define algebraic structures (groups, rings, fields) that recur across domains.

This abstraction is precisely what makes our framework powerful. When we say “FlashAttention exploits associativity,” we’re not making an analogy—we’re identifying the literal mathematical structure that permits the optimization.

3.4.3 Stepanov’s Legacy

Alexander Stepanov, architect of the C++ Standard Template Library, crystallized this in code. His book From Mathematics to Generic Programming [3] traces how the same algorithm applies to any type with the right properties.

// Stepanov's insight: power() works for ANY associative operation
template <typename T, typename N, typename Op>
T power(T x, N n, Op op) {
    // Works for: integers (+), matrices (*), strings (concat)
    // The algorithm is the same; the operation varies
    if (n == 0) return identity_element(op);
    while ((n & 1) == 0) { x = op(x, x); n >>= 1; }
    T result = x;
    n >>= 1;
    while (n != 0) {
        x = op(x, x);
        if ((n & 1) != 0) result = op(result, x);
        n >>= 1;
    }
    return result;
}

This function computes x^n for any associative operation. Apply it to: - Integers with addition → multiplication in O(log n) - Matrices with multiplication → matrix power in O(log n) - Recurrence states → Mamba’s parallel scan

The same algorithm, four thousand years of applications.

3.4.4 Monoids: Why Identity Matters

A subtle but critical point: for practical parallel algorithms, we need not just associativity but also an identity element.

# Semigroup (associativity only):
reduce([], add)  # What's the sum of nothing?

# Monoid (associativity + identity):
reduce([], add, identity=0)  # The sum of nothing is 0

Every parallel reduction needs a starting point. The identity element provides it. This is why the mathematical term monoid appears in FlashAttention papers—it’s the precise structure required.

The six properties are our inheritance from millennia of mathematical discovery. We’re not inventing new ideas; we’re applying ancient wisdom at unprecedented scale.

3.5 The Hardware Amplifier

Mathematical properties are necessary but not sufficient. They tell us what’s legal. Hardware tells us what’s profitable.

Property              Hardware Amplifier           Effective Speedup
────────────────────────────────────────────────────────────────────
Associativity    ×    Parallel cores              = Parallel scaling
Separability     ×    Register/cache reuse        = Memory reduction
Locality         ×    Memory hierarchy            = Cache efficiency
Sparsity         ×    Conditional execution       = Compute reduction
Redundancy       ×    Tensor cores (low-bit)      = Throughput increase
Symmetry         ×    Weight sharing (memory)     = Capacity increase

Example: FlashAttention

Property: Associativity of softmax (with proper state)

Hardware amplifier: GPU shared memory (SRAM) is 10-100× faster than HBM

Combined effect: - Associativity allows tiling (mathematical enabler) - Tiling allows SRAM residence (hardware amplifier) - Result: 2-4× speedup

Without associativity, we couldn’t tile. Without fast SRAM, tiling wouldn’t help. Both are necessary.

3.6 The Recognition Process

How do you look at a computation and see optimization opportunities?

3.6.1 Step 1: Identify the Bottleneck

Is the computation: - Compute-bound? → Look for mathematical shortcuts - Memory-bound? → Look for locality, caching opportunities - Communication-bound? → Look for parallelization, reduction

3.6.2 Step 2: Examine the Structure

For each property, ask:

Property	Question
Associativity	Can I split and merge this computation?
Separability	Does this matrix/function have hidden structure?
Locality	Do nearby outputs depend on nearby inputs?
Sparsity	Are most elements zero or negligible?
Redundancy	Am I computing or storing the same thing twice?
Symmetry	Does the problem look the same under transformation?

3.6.3 Step 3: Match to Hardware

For each property you find, ask: - What hardware feature amplifies this? - Is that hardware feature available and fast enough?

3.6.4 Step 4: Verify Correctness

Mathematical properties guarantee correctness when properly applied: - Associativity requires proper state representation - Separability requires accuracy bounds - Locality requires boundary handling - Sparsity requires threshold selection - Redundancy requires cache invalidation - Symmetry requires equivariant operations

3.7 Case Study: Deriving FlashAttention

Let’s apply the framework to derive FlashAttention from first principles.

The problem: Standard attention materializes an \(N \times N\) matrix, using \(O(N^2)\) memory.

Step 1: Identify bottleneck. Memory-bound. The \(N \times N\) matrix doesn’t fit in SRAM for large \(N\).

Step 2: Examine structure.

Associativity? Softmax: \(\frac{e^{x_i}}{\sum_j e^{x_j}}\). Requires global sum—not obviously associative.

But wait. What if we track running (max, sum)?

\[\text{state} = (m, \ell) \text{ where } m = \max(x), \ell = \sum e^{x - m}\]

For two chunks: \[m_{new} = \max(m_1, m_2)\] \[\ell_{new} = \ell_1 \cdot e^{m_1 - m_{new}} + \ell_2 \cdot e^{m_2 - m_{new}}\]

This is associative! The state forms a monoid.

Locality? Attention has no inherent locality—every output depends on all inputs. But the associative structure creates a kind of “state locality”: we only need the running state, not all inputs.

Step 3: Match to hardware. - GPU has fast SRAM (shared memory): 164KB, ~8TB/s - GPU has slow HBM: 80GB, ~2TB/s - Associativity allows processing in chunks that fit in SRAM

Step 4: Verify correctness. The monoid structure mathematically guarantees that chunked processing equals full processing.

Result: FlashAttention—\(O(N^2)\) compute, \(O(N)\) memory, 2-4× speedup.

3.8 The Framework Applied to This Book

Every major topic in this book exploits these properties:

Chapter/Topic              Primary Property    Secondary Property
────────────────────────────────────────────────────────────────────
FlashAttention             Associativity       Locality (tiling)
LoRA                       Separability        Redundancy (fine-tune)
Matrix Multiply            Locality            Associativity
Quantization               Redundancy          Sparsity (outliers)
MoE                        Sparsity            Locality (experts)
KV Caching                 Redundancy          Associativity (streaming)
Distributed Training       Associativity       Separability (FSDP)
State Space Models         Associativity       Separability (kernel)
Continuous Batching        Associativity       Sparsity (variable load)
Prefix Caching             Redundancy          Locality (LRU)

3.9 Exercises

Identify the property: PagedAttention (vLLM) enables efficient memory management. Which property does it exploit? (Hint: It’s about avoiding waste.)
Find new applications: Ring attention exploits associativity to distribute attention across GPUs. What other operations could use ring-style parallelism?
Property combination: FlashAttention uses associativity + locality. Find an optimization that uses separability + sparsity. (Hint: Consider structured pruning.)
Counterexample: Find a computation where none of the six properties apply. What does this imply about optimization potential?
Novel optimization: Consider beam search in language models. What properties might enable more efficient implementations?

Solutions

1. PagedAttention exploits sparsity.

PagedAttention avoids pre-allocating contiguous memory for the maximum possible sequence length. Instead, it allocates memory in pages only as needed. This is sparsity applied to memory allocation—most sequences don’t use their maximum allocation, so don’t pay for it. The “sparse” pattern here is that actual memory usage << worst-case memory usage.

2. Ring-style parallelism applies to any associative operation:

All-reduce: Sum gradients across GPUs in a ring
Prefix scan: Distributed parallel prefix sums
Convolutions: Halo exchange patterns for spatial parallelism
Matrix multiply: Cannon’s algorithm uses ring communication
Any monoid operation: If it’s associative with identity, it can be chunked and combined

The key insight: ring communication pipelines data movement with computation. Any operation where partial results can be incrementally combined benefits.

3. Separability + Sparsity examples:

Structured pruning: Remove entire channels/heads (sparsity) that can be factored out (separability). A pruned attention head is both sparse (removed) and separable (independent of other heads).
LoRA with sparse adapters: Low-rank adaptation (separability) where the low-rank matrices are themselves sparse.
Block-sparse matrix factorization: Factor a matrix into low-rank components that are also block-sparse.

4. Counterexample: Truly irregular computations

Consider graph neural networks on power-law graphs with random node features and non-associative aggregation (like attention with data-dependent routing). Each node has different degree, neighbors are essentially random, and the computation can’t be reordered.

Implications: When no properties apply, optimization is limited to: - Hardware-level tricks (better memory access patterns) - Approximations that introduce structure (sampling, quantization) - Accepting that some computations are inherently expensive

This is rare in practice—most computations do have exploitable structure.

5. Beam search optimizations:

Redundancy: Beams share common prefixes. Use prefix caching to avoid recomputing attention for shared tokens. KV cache sharing across beams exploits this.
Locality: Beams that diverge recently have similar KV caches. Process similar beams together.
Sparsity: Most tokens have low probability. Speculative decoding prunes the search space early.
Associativity: Score computation can be parallelized across beams.

Speculative decoding is particularly powerful: use a small model to predict likely continuations (sparse candidates), then verify in parallel with the large model.

3.10 Key Takeaways

Properties explain optimizations. Every speedup traces to mathematical structure.
Six properties cover most cases: Associativity, separability, locality, sparsity, redundancy, symmetry.
Properties determine legality, hardware determines profit. Both are necessary.
Recognition is the skill. Learn to see properties in unfamiliar computations.
Properties compose. Most powerful optimizations exploit multiple properties.
The framework is generative. It doesn’t just explain known optimizations—it suggests new ones.

3.11 Connections

This chapter provides the theoretical foundation for the entire book:

Part I (Hardware): Establishes why properties matter—what hardware constraints they address.

Part II (Algebra): Deep dives into individual properties and their applications.

Part III (Investigations): Extended case studies applying multiple properties to real problems.

Part V (Method): The process of recognizing and applying properties in practice.

Every subsequent chapter can be read through this lens: “What property enables this optimization?”

The Generative Power

The framework’s true value is generative. When you encounter a new problem:

Profile to find the bottleneck
Examine the computation for properties
Match properties to available hardware
Derive the optimization

You’re not memorizing tricks—you’re learning to see structure.

[1]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré, “FlashAttention: Fast and memory-efficient exact attention with IO-awareness,” Advances in Neural Information Processing Systems, vol. 35, 2022.

[2]

E. J. Hu et al., “LoRA: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.

[3]

A. A. Stepanov and D. E. Rose, From mathematics to generic programming. Addison-Wesley Professional, 2014.