Linearity: Why Batching Works

Subscribe to Software Bits to get new articles in your inbox.

Here’s something that should seem stranger than it does:

You can train a neural network on 1,000 samples almost as fast as on 10.

Not 100x slower. Almost the same speed.

How? The answer is a single property: linearity.

The Property

A function $f$ is linear if:

\[f(\alpha x + \beta y) = \alpha f(x) + \beta f(y)\]

Scale the input, scale the output. Add inputs, add outputs. Combinations work as expected.

Matrix multiplication is the canonical linear operation:

\[f(x) = xW\]

Check: $(\alpha x + \beta y)W = \alpha xW + \beta yW$. ✓

This simple property is why modern deep learning is computationally tractable.

Why Batching Works

Consider a linear layer:

\[y = xW\]

For a single input $x \in \mathbb{R}^{1 \times d_{in}}$ (a row vector), you get output $y \in \mathbb{R}^{1 \times d_{out}}$.

Now consider a batch of $n$ inputs, stacked as rows:

\[X = \begin{bmatrix} x_1 \\ x_2 \\ \vdots \\ x_n \end{bmatrix} \in \mathbb{R}^{n \times d_{in}}\]

The batched computation:

\[Y = XW\]

gives you all $n$ outputs in one matrix multiply. Same $W$, same operation—just more rows.

This only works because the operation is linear.

If $f$ weren’t linear, you couldn’t factor through the batch. You’d have to compute $f(x_1), f(x_2), \ldots$ separately.

But for linear operations, batching is free—mathematically. You’re computing the same thing, just organized differently.

Why GPUs Love Linearity

Matrix multiplication is the most optimized operation in computing.

NVIDIA Tensor Cores: Designed specifically for GEMM (General Matrix Multiply)
Memory bandwidth: Amortized across the batch
Parallelism: Thousands of multiply-adds happening simultaneously

When you increase batch size:

Batch Size	Matrix Dimensions	GPU Utilization
1	(1, d_in) × (d_in, d_out)	Low (memory bound)
32	(32, d_in) × (d_in, d_out)	Better
256	(256, d_in) × (d_in, d_out)	High (compute bound)

The weight matrix $W$ is loaded once. Each additional sample in the batch is nearly free—you’re just doing more arithmetic while the data is already in fast memory.

Linearity turns “process n samples” into “one big matrix multiply.”

Gradient Accumulation

Here’s another consequence of linearity.

When you train on a batch, your loss is typically:

\[L = \frac{1}{n} \sum_{i=1}^{n} L_i\]

The gradient:

\[\nabla L = \frac{1}{n} \sum_{i=1}^{n} \nabla L_i\]

Sum is linear. So:

Compute gradients on samples 1-100, sum them
Compute gradients on samples 101-200, sum them
Add the partial sums

Same result as computing on all 200 at once.

This is gradient accumulation. When your batch doesn’t fit in memory, split it. Accumulate gradients across passes. Linearity guarantees correctness.

The same principle enables distributed training: compute gradients on different machines, sum them (all-reduce). Works because gradient aggregation is linear.

Why We Need Non-Linearity

If linearity is so great, why not make everything linear?

Because composition of linear functions is linear:

\[f(g(x)) = (xW_g)W_f = x(W_g W_f) = xW_{combined}\]

A 100-layer linear network equals a 1-layer linear network. No matter how deep you go, you can only learn linear functions.

Non-linearities create expressivity.

ReLU, GELU, softmax—these break linearity. They let deep networks approximate arbitrary functions.

The architecture of a neural network is:

Linear → Non-linear → Linear → Non-linear → ... → Linear

Linear operations: expensive, but batch-friendly, GPU-optimized. Non-linear operations: cheap (element-wise), parallel across the batch but no GEMM speedup.

This isn’t accidental. It’s engineered for hardware.

Where Linearity Breaks (And It Matters)

Batch Normalization

\[\text{BatchNorm}(x) = \gamma \cdot \frac{x - \mu_B}{\sigma_B} + \beta\]

The mean $\mu_B$ and standard deviation $\sigma_B$ depend on which samples are in the batch.

Change the batch composition → change the normalization → change the output.

This is why:

BatchNorm behaves differently in training vs. inference
Small batches give noisy estimates
BatchNorm can’t be cleanly gradient-accumulated

BatchNorm is not linear over the batch dimension.

Softmax in Attention

\[\text{softmax}(x)_i = \frac{e^{x_i}}{\sum_j e^{x_j}}\]

Every output depends on all inputs. You can’t compute softmax on parts and combine.

(Well, you can—that’s what we showed in the associativity article. But it requires the correction factor trick. It’s not trivially decomposable.)

Dropout

Stochastic. Different mask each time. Can’t be factored cleanly.

Backpropagation: Linearity of Differentiation

Here’s a deeper consequence.

Backpropagation relies on the chain rule:

\[\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x}\]

But it also relies on differentiation being a linear operator:

\[\frac{\partial}{\partial x}(f + g) = \frac{\partial f}{\partial x} + \frac{\partial g}{\partial x}\] \[\frac{\partial}{\partial x}(\alpha f) = \alpha \frac{\partial f}{\partial x}\]

Gradients add linearly. Scale linearly. This is why:

Gradient of a sum = sum of gradients
Gradient accumulation works
Automatic differentiation is efficient

If differentiation weren’t linear, we couldn’t train neural networks.

The entire training paradigm—backprop, SGD, Adam—relies on gradients being linear in how they combine.

Practical Implications

Batch Size Tuning

Larger batches → better GPU utilization → faster per-sample processing.

But: larger batches can hurt generalization (sharper minima, less noise).

The trade-off is between:

Hardware efficiency (wants large batches, because linearity makes them cheap)
Optimization dynamics (sometimes wants smaller batches, for noise/regularization)

Gradient Checkpointing

To save memory, you can:

Discard intermediate activations during forward pass
Recompute them during backward pass

This works because the forward pass is deterministic—same input, same output. Recompute any segment, get identical activations, get identical gradients.

LoRA and Adapter Merging

Low-Rank Adaptation adds a small update:

\[W' = W + BA\]

where $B$ and $A$ are low-rank matrices.

After training, you can merge the adapter back:

\[W_{merged} = W + BA\]

One matrix, no overhead at inference.

This works because matrix addition is linear. The adaptation is just a linear modification to the weights.

The Architecture of Efficiency

Modern neural networks are carefully designed around linearity:

Component	Linear?	Implication
Linear layers	Yes	Batching works, GEMM-optimized
Convolutions	Yes	Same benefits
Attention (QK^T, V multiply)	Yes	Batching works
ReLU, GELU	No (but element-wise)	Cheap, parallelizes trivially
Softmax	No	Requires FlashAttention tricks
BatchNorm	No (batch-dependent)	Training/inference difference
LayerNorm	Yes (per-sample)	Better for batching

Notice the trend: we use LayerNorm instead of BatchNorm in Transformers. Why? LayerNorm normalizes within each sample, not across the batch. It’s linear over the batch dimension.

Architecture choices reflect the desire to preserve linearity where it matters.

The Takeaway

Linearity is why batching works.

\[f(\text{batch}) = \text{batch of } f\]

For linear operations, processing a batch is just one big matrix multiply. GPUs are optimized for exactly this.

This single property enables:

Batched inference: 1000 samples nearly as fast as 1
Batched training: gradients over many samples at once
Gradient accumulation: split batches, sum gradients
Distributed training: sum gradients across machines
Backpropagation itself: gradients combine linearly

Neural networks are towers of linear operations with strategic non-linearities. The linear parts enable efficiency. The non-linear parts enable expressivity.

Lose linearity carelessly, and you lose the ability to batch. That’s why BatchNorm is tricky. That’s why softmax needed FlashAttention.

The algebra isn’t abstract. It’s why training is tractable at all.

Next in this series: Domain Transformations—why logarithms prevent underflow, why Fourier transforms speed up convolutions, and the art of finding easier spaces.

Taras Tsugrii