The Algebra of Speed

Mathematical Foundations of Computational Performance

Author

Taras Tsugrii

Published

February 13, 2026

Preface

FlashAttention delivers 2-4× speedups. LoRA fine-tunes 65-billion-parameter models on a single GPU. Mixtral matches models 5× its compute budget.

Ask practitioners why these techniques work, and you get implementation details: tiling, low-rank adapters, routing functions.

But those are how, not why.

The why is mathematical. FlashAttention works because softmax has associative structure—a property that licenses chunking. LoRA works because fine-tuning is low-rank—a property that licenses factorization. Mixtral works because different inputs need different parameters—a property that licenses conditional computation.

Properties explain optimizations.

This book is about those properties. About recognizing them. About knowing when to apply them. About developing the problem-solver’s eye that sees not “here’s a trick that worked” but “here’s a structure that enables a class of tricks.”

0.1 What This Book Is

This is not a recipe book. It won’t tell you “to optimize X, do Y.”

This is a book about understanding. Each chapter is an investigation:

We start with something puzzling—a phenomenon that demands explanation
We form hypotheses and test them
We’re sometimes wrong—and the wrongness is instructive
We reach understanding and extract something general

The structure mirrors how performance understanding actually develops. It’s messy. It’s iterative. The answer isn’t obvious from the start. You develop intuition by being wrong and correcting.

0.2 The Three Pillars

Performance lives at the intersection of three domains:

          MATHEMATICS
              │
              │ What structures make
              │ computation tractable?
              │
              ▼
    ┌─────────────────────┐
    │                     │
    │   PERFORMANCE       │
    │                     │
    └─────────────────────┘
              ▲
             ╱ ╲
            ╱   ╲
           ╱     ╲
          ╱       ╲
   HARDWARE        METHODOLOGY
       │                │
       │                │
  How does the     How do we find
  machine reward   the right
  structure?       structure?

Mathematics provides the properties: associativity, locality, separability, sparsity. These determine what transformations are legal.

Hardware provides the constraints: memory hierarchies, parallelism, bandwidth limits. These determine what transformations are profitable.

Methodology provides the process: measurement, hypothesis, analogy, verification. This is how we discover which properties apply to our problem.

Most performance resources cover one pillar. This book weaves all three together, because understanding requires seeing across levels.

0.3 Who This Book Is For

Primary audience: Engineers and researchers who work on performance-critical code, especially ML systems.

Assumed background:

Comfortable with code (Python, C/C++, or similar)
Basic understanding of computer architecture (caches, cores, memory)
Some math (linear algebra, basic calculus)
Curiosity about why things work, not just what works

Not for:

Complete beginners (need programming foundations first)
Readers seeking quick tips without understanding
Those who just want to copy-paste optimizations

0.4 How to Read This Book

The book is designed for multiple reading patterns:

Linear reading: Parts build on each other. The Algebraic Framework establishes the theory. Part I covers hardware. Part II introduces properties. Parts III-IV apply them to algorithms and systems. Part V covers methodology, Part VI provides practical tools, and Part VII synthesizes.

Investigation hopping: Each investigation in Parts III-IV is somewhat self-contained. If you’re curious about FlashAttention specifically, you can start there—with occasional references back to earlier material.

Interactive exploration: Many chapters include embedded visualizations and linked notebooks. The investigations come alive when you do them, not just read them.

0.5 Learning Paths

Different readers have different goals. Here are recommended paths through the material:

0.5.1 Path A: “I want to understand the theory”

For researchers and those seeking deep understanding of why optimizations work.

The Algebraic Framework -> The Memory Hierarchy -> Thinking in Bandwidth -> When Parallelism Pays -> GPU Architecture and Programming Model -> Associativity -> Separability -> Sparsity -> Reversibility -> Locality -> Redundancy -> Symmetry -> Investigation: FlashAttention -> Investigation: LoRA -> State-Space Models

Focus on: Mathematical derivations, property recognition, first-principles reasoning.

Skip: Hardware reference appendix, tool-specific chapters (can revisit later).

0.5.2 Path B: “I want to optimize my ML system”

For practitioners building production systems who need practical speedups.

The Algebraic Framework (skim) -> The Memory Hierarchy -> GPU Architecture and Programming Model -> Inference Optimization -> Advanced LLM Serving -> Mastering GPU Memory -> Investigation: Quantization -> Profiling Tools: From Novice to Expert -> torch.compile Deep Dive

Focus on: Bottleneck identification, configuration tuning, practical patterns.

Skip: Mathematical derivations (can revisit for deeper understanding).

0.5.3 Path C: “I want to write custom kernels”

For engineers who need to implement novel operations or optimize beyond libraries.

GPU Architecture and Programming Model -> Investigation: Matrix Multiply -> Investigation: FlashAttention -> Writing Fast Kernels with Triton -> Modern GPU Kernel Frameworks -> Profiling Tools: From Novice to Expert

Focus on: Memory hierarchy on GPU, tiling patterns, Triton/CUDA programming.

Skip: High-level systems chapters (serving, distributed) initially.

0.5.4 Path D: “I want to understand a specific topic”

Each major topic is approachable with minimal prerequisites:

Topic	Prerequisites	Chapter(s)
FlashAttention	The Algebraic Framework, The Memory Hierarchy	Investigation: FlashAttention
LoRA	The Algebraic Framework, Separability	Investigation: LoRA
Quantization	The Memory Hierarchy, Thinking in Bandwidth	Investigation: Quantization
Distributed Training	When Parallelism Pays, GPU Architecture and Programming Model	Distributed Training
LLM Serving	Inference Optimization, Mastering GPU Memory	Advanced LLM Serving
MoE	Sparsity, Distributed Training	Mixture of Experts

0.6 The Interactive Elements

This book has three layers of interactivity:

0.6.1 Embedded Visualizations

Interactive diagrams run directly in your browser. Adjust parameters, see effects. No installation required.

0.6.2 Quick Experiments (JupyterLite)

Python notebooks that run entirely in your browser via WebAssembly. Good for understanding algorithms. Not suitable for performance measurement (WASM is slow).

0.6.3 GPU Experiments (Colab/Kaggle)

Real notebooks with real GPUs. This is where you measure actual speedups. Free tier is sufficient for all examples.

A Note on Performance Numbers

Cloud notebooks have variable performance. Your results may differ from the book’s.

Focus on relative speedups (2× faster) rather than absolute times (23ms). Relative speedups are more stable across hardware.

For serious benchmarking, run locally on controlled hardware.

Benchmark Provenance

Unless otherwise noted, performance numbers in this book were measured on:

GPU: NVIDIA A100 80GB SXM (for GPU benchmarks)
CPU: AMD EPYC 7763 64-core (for CPU benchmarks)
Software: PyTorch 2.2+, CUDA 12.1, Python 3.10
Date: 2024-2025

Where specific chapters use different hardware (e.g., H100 numbers in the bandwidth chapter), this is noted inline. Hardware specifications cited for GPUs not benchmarked (e.g., Blackwell) are based on published specs at the time of writing and may differ from final silicon.

Performance numbers are illustrative — they demonstrate relative behaviors and orders of magnitude, not absolute guarantees. Always benchmark on your own hardware for production decisions.

0.7 The Thesis

If there’s one idea this book wants to convey, it’s this:

The algebra isn’t abstract. It’s why modern machine learning is computationally tractable at all.

FlashAttention isn’t magic—it’s associativity. LoRA isn’t magic—it’s separability. Quantization isn’t magic—it’s redundancy.

Once you see the mathematical structure, you can derive the technique. And you can recognize when the same structure appears in a new problem, waiting to be exploited.

That’s the skill this book aims to develop: seeing the properties that enable performance, not just memorizing the tricks that result from them.

0.8 Acknowledgments

This book stands on the shoulders of giants:

Alexander Stepanov and Daniel Rose, whose From Mathematics to Generic Programming showed that abstract algebra explains practical programming
Brendan Gregg, whose methodologies brought rigor to systems performance
George Polya, whose How to Solve It taught generations how to think about problems
The explorable explanations community, for showing that interactivity enhances understanding
The countless researchers whose work this book attempts to explain and connect

0.9 Let’s Begin

Before diving into hardware specifics or optimization techniques, we need to establish the right mental model.

The next chapter introduces Thinking in Arrays—the cognitive shift from loop-based programming to array-oriented computation. This isn’t just syntax; it’s a fundamentally different way of seeing programs that makes mathematical structure visible.

Following that, The Algebraic Framework provides the vocabulary—the six fundamental properties that enable all significant optimizations. Together, these two chapters form the conceptual foundation for everything that follows.

Let’s begin.

Thinking in Arrays →

0.10 More from the Author

0.10.1 The First Principles Trilogy

This book is part of a series teaching ML fundamentals from first principles:

📘 Building LLMs from First Principles Learn how transformers work by building them from scratch—full math derivations, working code, and comprehensive test suites. From Markov chains to GPT.

🔬 Mechanistic Interpretability from First Principles Reverse-engineer neural networks to understand their internal algorithms. Features, superposition, circuits, and sparse autoencoders explained from the ground up.

⚡ The Algebra of Speed (You are here) Mathematical foundations of computational performance. Why FlashAttention, LoRA, and quantization work—and how to recognize when similar optimizations apply to your problems.

🌐 Distributed Training from First Principles Deep dive into distributed training—data parallelism, tensor parallelism, pipeline parallelism, and beyond. The natural continuation of this book’s distributed chapter.

0.10.2 Blog

✍️ Software Bits — Short, focused essays on performance, ML, and computer science fundamentals. Subscribe for updates.

💻 GitHub: perf-bits — Blog posts with full code and interactive demos.