11 Sparsity
The Art of Not Computing
The most powerful optimization: don’t do the work at all.
A matrix is 90% zeros. Shouldn’t we be 10× faster?
Usually not. And understanding why reveals deep truths about hardware.
The systematic study of sparse computation began with solving large systems of linear equations arising from physical simulations. Engineers noticed that most coefficients were zero—the matrix structure mirrored the physical locality of interactions.
The key insight: exploiting sparsity requires structure, not just zeros. Random sparsity doesn’t help; structured sparsity (banded, block-diagonal, graph-based) enables specialized algorithms. This lesson applies directly to modern ML: structured sparsity (e.g., 2:4 patterns for tensor cores) wins where random sparsity fails.
11.1 The Sparsity Paradox
Sparsity is everywhere:
- Neural network weights after pruning
- Attention patterns (most tokens don’t attend to most other tokens)
- Graph adjacency matrices
- Recommender systems (users rate few items)
- Gradient updates (many parameters barely change)
Mathematically, 90% sparsity means 90% fewer operations. But in practice:
| Sparsity | Theoretical Speedup | Actual Speedup |
|---|---|---|
| 50% | 2× | 1.1× |
| 90% | 10× | 1.8× |
| 95% | 20× | 2.3× |
| 99% | 100× | 3.5× |
The gap is enormous. Where does the performance go?
11.2 Why Hardware Hates Random Sparsity
Modern hardware is built for predictable computation. Sparse computation is unpredictable.
11.2.1 The Memory Access Problem
Dense matrix multiply:
A[i,j] = B[i,:] @ C[:,j]
= B[i,0]*C[0,j] + B[i,1]*C[1,j] + ... + B[i,k]*C[k,j]
Memory access pattern: sequential along rows and columns. Predictable. Prefetchable.
Sparse matrix multiply:
A[i,j] = sum(B[i,k] * C[k,j] for k where B[i,k] != 0)
Memory access pattern: wherever the non-zeros happen to be. Random. Unprefetchable.
Dense access pattern: Sparse access pattern:
┌─────────────────────┐ ┌─────────────────────┐
│█ █ █ █ █ █ █ █ █ █ │ │█ █ █ │
│█ █ █ █ █ █ █ █ █ █ │ │ █ █ │
│█ █ █ █ █ █ █ █ █ █ │ │ █ █ │
│█ █ █ █ █ █ █ █ █ █ │ │ █ █ │
└─────────────────────┘ └─────────────────────┘
Sequential → Fast Random → Slow
11.2.2 The Indexing Overhead
Dense operations don’t need indices. Element (i, j) is at base + i*stride + j.
Sparse formats need indices:
# CSR format (Compressed Sparse Row)
class CSRMatrix:
def __init__(self):
self.data = [...] # Non-zero values
self.indices = [...] # Column index of each value
self.indptr = [...] # Where each row starts
# For a 1000×1000 matrix with 1% non-zeros:
# Dense: 1,000,000 values
# Sparse: 10,000 values + 10,000 column indices + 1,001 row pointers
# = 21,001 numbers (but we also have indexing overhead)The indexing itself requires memory accesses and compute.
11.2.3 The Parallelism Problem
GPUs execute in lockstep (SIMT). In a warp of 32 threads:
Dense: All threads do the same operation on different data
Thread 0: A[0,0] * B[0,0]
Thread 1: A[0,1] * B[1,0]
...
Thread 31: A[0,31] * B[31,0]
→ All threads active
Sparse: Each thread might have different work (or no work)
Thread 0: A[0,5] * B[5,0] (if A[0,5] != 0)
Thread 1: skip (if A[0,1] == 0)
Thread 2: A[0,7] * B[7,0] (if A[0,7] != 0)
...
→ Many threads idle
Sparse patterns cause warp divergence and load imbalance.
11.3 Structured Sparsity: The Compromise
Random sparsity is hard. Structured sparsity is tractable.
The key insight: constrain where zeros can appear, so hardware can predict and optimize.
11.3.1 Block Sparsity
Require zeros to occur in blocks:
Random sparsity: Block sparsity (4×4 blocks):
┌─────────────────────┐ ┌─────────────────────┐
│█ █ █ │ │████ ████ │
│ █ █ │ │████ ████ │
│ █ █ │ │████ ████ │
│ █ █ │ │████ ████ │
│ █ █ │ │ │
│ █ █ │ │ │
│ █ █ │ │ ████████ │
│ █ █ │ │ ████████ │
└─────────────────────┘ └─────────────────────┘
With block sparsity: - Memory access is contiguous within blocks - No indexing within blocks - Load balancing is easier (work per block is uniform)
def block_sparse_matmul(A_blocks, A_indices, B, block_size=64):
"""
A is stored as a list of dense blocks + their positions.
"""
result = torch.zeros(A.shape[0], B.shape[1])
for block, (i, j) in zip(A_blocks, A_indices):
# Each block is a dense tile
row_start = i * block_size
row_end = row_start + block_size
col_start = j * block_size
col_end = col_start + block_size
# Dense matmul for this block
result[row_start:row_end] += block @ B[col_start:col_end]
return resultBlock sparsity achieves ~80% of theoretical speedup (vs. ~20% for random sparsity at same density).
11.3.2 2:4 Structured Sparsity
NVIDIA’s Ampere GPUs introduced hardware support for “2:4” sparsity:
Constraint: In every group of 4 consecutive elements, exactly 2 must be zero.
Valid 2:4 patterns (█ = nonzero, ░ = zero):
██░░ █░█░ █░░█ ░██░ ░█░█ ░░██
Invalid:
███░ (only 1 zero)
█░░░ (3 zeros)
░░░░ (4 zeros)
This gives exactly 50% sparsity, with hardware acceleration:
Standard FP16: Tensor Core throughput: 312 TFLOPS
2:4 Sparse FP16: Tensor Core throughput: 624 TFLOPS (2× faster)
The hardware knows the pattern: 2 non-zeros per 4 elements. It can skip the zeros efficiently.
# PyTorch 2.0+ supports 2:4 sparsity
import torch
from torch.sparse import to_sparse_semi_structured
# Convert a dense tensor to 2:4 sparse
dense = torch.randn(1024, 1024, device='cuda', dtype=torch.float16)
sparse_24 = to_sparse_semi_structured(dense)
# Matmul uses accelerated 2:4 kernels
result = torch.mm(sparse_24, other_matrix) # ~2× faster on Ampere+11.3.3 The Accuracy Trade-off
Forcing structure into sparsity costs accuracy:
| Sparsity Type | Achievable Sparsity | Accuracy Drop (ResNet-50) |
|---|---|---|
| Unstructured | 90% | 1-2% |
| Block (16×16) | 75% | 1-2% |
| Block (64×64) | 50% | 1-2% |
| 2:4 Structured | 50% | <1% |
Unstructured sparsity is most flexible but least accelerable. 2:4 is most accelerable but constrains sparsity to exactly 50%.
11.4 Investigation: Mixture of Experts
What if, instead of making weights sparse, we made activation of weights conditional?
Mixture of Experts (MoE): Different inputs use different parts of the network.
11.4.1 The Architecture
Standard feed-forward:
Input → [FFN] → Output
Every token uses the full FFN
MoE feed-forward:
Input → [Router] → selects Expert(s)
→ [Expert 1] ─┐
→ [Expert 2] ├→ Weighted sum → Output
→ [Expert 3] │
→ ... │
→ [Expert N] ─┘
The router is a small network that decides which expert(s) to use for each input.
class MoELayer(nn.Module):
def __init__(self, dim, num_experts=8, top_k=2):
super().__init__()
self.experts = nn.ModuleList([
nn.Linear(dim, dim * 4) # Each expert is an FFN
for _ in range(num_experts)
])
self.gate = nn.Linear(dim, num_experts)
self.top_k = top_k
def forward(self, x):
# x: [batch, seq, dim]
batch, seq, dim = x.shape
# Compute routing probabilities
router_logits = self.gate(x) # [batch, seq, num_experts]
router_probs = F.softmax(router_logits, dim=-1)
# Select top-k experts per token
top_k_probs, top_k_indices = router_probs.topk(self.top_k, dim=-1)
# Renormalize
top_k_probs = top_k_probs / top_k_probs.sum(dim=-1, keepdim=True)
# Compute output (simplified, actual impl batches by expert)
output = torch.zeros_like(x)
for k in range(self.top_k):
expert_idx = top_k_indices[:, :, k]
weight = top_k_probs[:, :, k:k+1]
for e in range(len(self.experts)):
mask = (expert_idx == e)
if mask.any():
expert_out = self.experts[e](x[mask])
output[mask] += weight[mask] * expert_out
return output11.4.2 The Efficiency Win
Mixtral 8x7B: - 8 experts, uses top-2 per token - 46.7B total parameters - 12.9B active parameters per token (2/8 × 46.7B + overhead) - Performs like a ~70B dense model
Parameter efficiency:
Dense 70B: 70B params active, 70B params total
Mixtral: 12.9B params active, 46.7B params total
Active params: 5.4× fewer
Total params: 0.67× as many
Performance: Similar to dense 70B
MoE is a form of conditional computation: the model dynamically decides what work to do.
11.4.3 The Challenges
MoE isn’t free:
Load balancing: If all tokens route to the same expert, that expert becomes a bottleneck. Needs auxiliary losses to encourage balance.
Memory: All experts must be loaded, even if only 2 are active per token. 46.7B parameters in memory for 12.9B active.
Communication: In distributed training, experts live on different GPUs. Routing requires all-to-all communication.
Inference latency: For single-token generation, only 2 experts are used—but all must be resident. Memory bandwidth, not compute, often dominates.
The MoE memory paradox:
Throughput (batch processing): MoE wins
- Tokens spread across experts
- Effective compute scales with batch
Latency (single token): Dense wins
- All params in memory, few active
- Memory-bound: loading 46.7B to use 12.9B
11.5 Sparsity in Attention
Attention is naturally sparse. Most tokens don’t strongly attend to most other tokens.
11.5.1 Empirical Attention Patterns
Analysis of trained transformers shows:
Typical attention pattern (one head):
Query position →
1 2 3 4 5 6 7 8 9 10 11 12
┌────────────────────────────────────┐
1 │ █ │
2 │ █ █ │
3 │ █ ░ █ │
4 │ █ ░ ░ █ │ █ = high attention
5 │ █ ░ ░ ░ █ │ ░ = low attention
6 │ █ ░ ░ ░ ░ █ │ (blank = ~zero)
7 │ █ ░ ░ ░ ░ ░ █ │
8 │ █ ░ ░ ░ ░ ░ ░ █ │
9 │ █ ░ ░ ░ ░ ░ ░ ░ █ │
10│ █ ░ ░ ░ ░ ░ ░ ░ ░ █ │
11│ █ ░ ░ ░ ░ ░ ░ ░ ░ ░ █ │
12│ █ ░ ░ ░ ░ ░ ░ ░ ░ ░ ░ █ │
└────────────────────────────────────┘
Key ↓
Pattern: Attend to (1) self, (2) first token, (3) nearby tokens
Most of the matrix is near-zero.
This suggests we could skip most attention computations.
11.5.2 Sparse Attention Variants
Sliding window: Only attend within a local window.
def sliding_window_attention(Q, K, V, window_size=256):
seq_len = Q.shape[0]
output = torch.zeros_like(Q)
for i in range(seq_len):
start = max(0, i - window_size)
end = i + 1 # Causal: only attend to past
q = Q[i:i+1]
k = K[start:end]
v = V[start:end]
attn = F.softmax(q @ k.T / sqrt(d), dim=-1)
output[i] = attn @ v
return outputComplexity: O(n × window) instead of O(n²)
Sparse patterns: Combine local + global + random.
Longformer/BigBird patterns: - Local window attention (nearby tokens) - Global tokens (attend to/from everywhere) - Random attention (a few random connections)
Sparse attention pattern:
Global Local Random
↓ ↓ ↓
████████████████████████ ← Global tokens (attend everywhere)
████████████████████████
██████ ← Local window
██████
██████
██████ █ ← Random connections
██████
██████ █
11.5.3 The FlashAttention Insight
FlashAttention (Chapter 4, Chapter 10) takes a different approach: don’t make attention sparse, but compute it exactly with better memory access.
For many applications, FlashAttention + dense attention beats sparse attention: - No approximation error - Simpler implementation - Hardware-friendly access patterns
Sparse attention shines for very long sequences (>16K tokens) where even FlashAttention’s O(n²) compute is prohibitive.
11.6 The Sparsity-Hardware Co-Design
The lesson from sparsity: algorithms and hardware must be designed together.
| Algorithm Wants | Hardware Needs | Compromise |
|---|---|---|
| Skip arbitrary zeros | Predictable patterns | Block or 2:4 sparsity |
| Prune anywhere | Load balance | Structured pruning |
| Different paths/token | Same instruction/warp | MoE with batched routing |
| Sparse attention | Dense tensor ops | Fixed sparse patterns |
The pattern: constrain flexibility for predictability.
11.7 Key Takeaways
Random sparsity doesn’t help: Hardware can’t exploit unpredictable skip patterns. 90% sparsity often gives only 2× speedup.
Structure enables acceleration: Block sparsity, 2:4 sparsity, and MoE all trade flexibility for predictability—and hardware rewards that.
MoE is conditional computation: Instead of sparse weights, choose which weights to use. Scales capacity without scaling compute.
Attention is empirically sparse: But exploiting it is hard. Dense FlashAttention often beats sparse approximations.
Co-design is essential: The best sparsity pattern depends on your hardware. NVIDIA 2:4 only matters on Ampere+.
11.8 The Meta-Lesson
All three optimization properties we’ve seen—associativity (chunking), separability (factoring), and sparsity (skipping)—share a theme:
Find structure that aligns with hardware.
- Associativity aligns with parallel reduction
- Separability aligns with GEMM efficiency
- Structured sparsity aligns with predictable memory access
The property tells you what’s mathematically possible. The hardware tells you what’s practically fast.
11.9 Further Reading
- Hoefler et al. (2021). “Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks”
- Fedus et al. (2021). “Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity”
- Mishra et al. (2021). “Accelerating Sparse Deep Neural Networks” (NVIDIA 2:4 sparsity)
- Beltagy et al. (2020). “Longformer: The Long-Document Transformer”