The Algebra of Speed

Mathematical Foundations of Computational Performance

Author

Taras Tsugrii

Published

January 15, 2026

Preface

FlashAttention delivers 2-4× speedups. LoRA fine-tunes 65-billion-parameter models on a single GPU. Mixtral matches models 5× its compute budget.

Ask practitioners why these techniques work, and you get implementation details: tiling, low-rank adapters, routing functions.

But those are how, not why.

The why is mathematical. FlashAttention works because softmax has associative structure—a property that licenses chunking. LoRA works because fine-tuning is low-rank—a property that licenses factorization. Mixtral works because different inputs need different parameters—a property that licenses conditional computation.

Properties explain optimizations.

This book is about those properties. About recognizing them. About knowing when to apply them. About developing the problem-solver’s eye that sees not “here’s a trick that worked” but “here’s a structure that enables a class of tricks.”


0.1 What This Book Is

This is not a recipe book. It won’t tell you “to optimize X, do Y.”

This is a book about understanding. Each chapter is an investigation:

  1. We start with something puzzling—a phenomenon that demands explanation
  2. We form hypotheses and test them
  3. We’re sometimes wrong—and the wrongness is instructive
  4. We reach understanding and extract something general

The structure mirrors how performance understanding actually develops. It’s messy. It’s iterative. The answer isn’t obvious from the start. You develop intuition by being wrong and correcting.


0.2 The Three Pillars

Performance lives at the intersection of three domains:

          MATHEMATICS
              │
              │ What structures make
              │ computation tractable?
              │
              ▼
    ┌─────────────────────┐
    │                     │
    │   PERFORMANCE       │
    │                     │
    └─────────────────────┘
              ▲
             ╱ ╲
            ╱   ╲
           ╱     ╲
          ╱       ╲
   HARDWARE        METHODOLOGY
       │                │
       │                │
  How does the     How do we find
  machine reward   the right
  structure?       structure?

Mathematics provides the properties: associativity, locality, separability, sparsity. These determine what transformations are legal.

Hardware provides the constraints: memory hierarchies, parallelism, bandwidth limits. These determine what transformations are profitable.

Methodology provides the process: measurement, hypothesis, analogy, verification. This is how we discover which properties apply to our problem.

Most performance resources cover one pillar. This book weaves all three together, because understanding requires seeing across levels.


0.3 Who This Book Is For

Primary audience: Engineers and researchers who work on performance-critical code, especially ML systems.

Assumed background:

  • Comfortable with code (Python, C/C++, or similar)
  • Basic understanding of computer architecture (caches, cores, memory)
  • Some math (linear algebra, basic calculus)
  • Curiosity about why things work, not just what works

Not for:

  • Complete beginners (need programming foundations first)
  • Readers seeking quick tips without understanding
  • Those who just want to copy-paste optimizations

0.4 How to Read This Book

The book is designed for multiple reading patterns:

Linear reading: Parts build on each other. The Algebraic Framework establishes the theory. Part I covers hardware. Part II introduces properties. Parts III-IV apply them to algorithms and systems. Part V provides practical tools.

Investigation hopping: Each investigation in Parts III-IV is somewhat self-contained. If you’re curious about FlashAttention specifically, you can start there—with occasional references back to earlier material.

Interactive exploration: Many chapters include embedded visualizations and linked notebooks. The investigations come alive when you do them, not just read them.


0.5 Learning Paths

Different readers have different goals. Here are recommended paths through the material:

0.5.1 Path A: “I want to understand the theory”

For researchers and those seeking deep understanding of why optimizations work.

The Algebraic Framework → Part I (Hardware) → Part II (Algebra)
    → FlashAttention → LoRA → State Space Models

Focus on: Mathematical derivations, property recognition, first-principles reasoning.

Skip: Hardware reference appendix, tool-specific chapters (can revisit later).

0.5.2 Path B: “I want to optimize my ML system”

For practitioners building production systems who need practical speedups.

The Algebraic Framework (skim) → Memory Hierarchy → GPU Architecture
    → Inference → Advanced Serving → GPU Memory → Quantization
    → Profiling Tools → torch.compile

Focus on: Bottleneck identification, configuration tuning, practical patterns.

Skip: Mathematical derivations (can revisit for deeper understanding).

0.5.3 Path C: “I want to write custom kernels”

For engineers who need to implement novel operations or optimize beyond libraries.

GPU Architecture → Matrix Multiply → FlashAttention
    → Triton → GPU Kernel Frameworks → Profiling Tools

Focus on: Memory hierarchy on GPU, tiling patterns, Triton/CUDA programming.

Skip: High-level systems chapters (serving, distributed) initially.

0.5.4 Path D: “I want to understand a specific topic”

Each major topic is approachable with minimal prerequisites:

Topic Prerequisites Chapter(s)
FlashAttention Algebraic Framework, Memory Hierarchy FlashAttention
LoRA Algebraic Framework, Factoring LoRA
Quantization Memory Hierarchy, Bandwidth Quantization
Distributed Training Parallelism, GPU Architecture Distributed
LLM Serving Inference, GPU Memory Advanced Serving
MoE Skipping, Distributed Mixture of Experts

0.6 The Interactive Elements

This book has three layers of interactivity:

0.6.1 Embedded Visualizations

Interactive diagrams run directly in your browser. Adjust parameters, see effects. No installation required.

0.6.2 Quick Experiments (JupyterLite)

Python notebooks that run entirely in your browser via WebAssembly. Good for understanding algorithms. Not suitable for performance measurement (WASM is slow).

0.6.3 GPU Experiments (Colab/Kaggle)

Real notebooks with real GPUs. This is where you measure actual speedups. Free tier is sufficient for all examples.

TipA Note on Performance Numbers

Cloud notebooks have variable performance. Your results may differ from the book’s.

Focus on relative speedups (2× faster) rather than absolute times (23ms). Relative speedups are more stable across hardware.

For serious benchmarking, run locally on controlled hardware.


0.7 The Thesis

If there’s one idea this book wants to convey, it’s this:

The algebra isn’t abstract. It’s why modern machine learning is computationally tractable at all.

FlashAttention isn’t magic—it’s associativity. LoRA isn’t magic—it’s separability. Quantization isn’t magic—it’s redundancy.

Once you see the mathematical structure, you can derive the technique. And you can recognize when the same structure appears in a new problem, waiting to be exploited.

That’s the skill this book aims to develop: seeing the properties that enable performance, not just memorizing the tricks that result from them.


0.8 Acknowledgments

This book stands on the shoulders of giants:

  • Alexander Stepanov and Daniel Rose, whose From Mathematics to Generic Programming showed that abstract algebra explains practical programming
  • Brendan Gregg, whose methodologies brought rigor to systems performance
  • George Polya, whose How to Solve It taught generations how to think about problems
  • The explorable explanations community, for showing that interactivity enhances understanding
  • The countless researchers whose work this book attempts to explain and connect

0.9 Let’s Begin

Before diving into hardware specifics or optimization techniques, we need to establish the right mental model.

The next chapter introduces Thinking in Arrays—the cognitive shift from loop-based programming to array-oriented computation. This isn’t just syntax; it’s a fundamentally different way of seeing programs that makes mathematical structure visible.

Following that, The Algebraic Framework provides the vocabulary—the six fundamental properties that enable all significant optimizations. Together, these two chapters form the conceptual foundation for everything that follows.

Let’s begin.

Thinking in Arrays →


0.10 More from the Author

0.10.1 The First Principles Trilogy

This book is part of a series teaching ML fundamentals from first principles:

📘 Building LLMs from First Principles Learn how transformers work by building them from scratch—full math derivations, working code, and comprehensive test suites. From Markov chains to GPT.

🔬 Mechanistic Interpretability from First Principles Reverse-engineer neural networks to understand their internal algorithms. Features, superposition, circuits, and sparse autoencoders explained from the ground up.

The Algebra of Speed (You are here) Mathematical foundations of computational performance. Why FlashAttention, LoRA, and quantization work—and how to recognize when similar optimizations apply to your problems.

0.10.2 Blog

✍️ Software Bits — Short, focused essays on performance, ML, and computer science fundamentals. Subscribe for updates.

💻 GitHub: perf-bits — Blog posts with full code and interactive demos.