Smoothness: The License to Go Deep

Subscribe to Software Bits to get new articles in your inbox.

In 1991, Sepp Hochreiter proved something discouraging.

In his diploma thesis, he showed that gradients in deep networks either shrink exponentially (vanish) or grow exponentially (explode). Either way, training fails. The deeper the network, the worse the problem.

This wasn’t a bug to be fixed. It was a theorem. Deep networks were mathematically untrainable.

Then in 2015, Kaiming He trained a 152-layer network. It won ImageNet. A year later, researchers trained networks with 1,000+ layers.

What changed? A single property: smoothness.

The Problem

Consider a 50-layer network. During backpropagation, the gradient flows from output to input:

\[\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial h_{50}} \cdot \frac{\partial h_{50}}{\partial h_{49}} \cdot \frac{\partial h_{49}}{\partial h_{48}} \cdots \frac{\partial h_2}{\partial h_1} \cdot \frac{\partial h_1}{\partial W_1}\]

Each $\frac{\partial h_{i+1}}{\partial h_i}$ is the Jacobian of layer $i$. Gradients multiply through 50 of these.

If each Jacobian has spectral norm 0.9:

\[0.9^{50} \approx 0.005\]

The gradient reaching the first layer is 200× smaller than it should be. The early layers barely learn.

If each Jacobian has spectral norm 1.1:

\[1.1^{50} \approx 117\]

The gradient explodes. Training diverges.

This is the vanishing/exploding gradient problem. It’s not about bad hyperparameters. It’s about exponential dynamics. Anything other than exactly 1.0 compounds across layers.

The Property

Smoothness means gradients are bounded:

\[\|f(x) - f(y)\| \leq L \|x - y\|\]

A function is L-Lipschitz if it doesn’t change faster than $L$ times the input change.

For neural networks, this translates to: gradient magnitudes stay in a useful range.

Too small → vanishing (layers don’t learn)
Too large → exploding (training diverges)
Just right → stable training at any depth

Every technique that enables deep training works by enforcing smoothness in some way.

Residual Connections: The Gradient Highway

Here’s the breakthrough that changed everything.

Instead of learning $h_{l+1} = F(h_l)$, ResNets learn:

\[h_{l+1} = h_l + F(h_l)\]

The input passes through unchanged; the network only learns the residual.

Why does this help gradients? Take the derivative:

\[\frac{\partial h_{l+1}}{\partial h_l} = I + \frac{\partial F}{\partial h_l}\]

The gradient is the identity plus something. Even if $\frac{\partial F}{\partial h_l}$ vanishes completely, the gradient is still $I$—the identity matrix with eigenvalues of exactly 1.

┌─────────────────────────────────────────────────────────────────┐
│                                                                 │
│   PLAIN NETWORK                    RESIDUAL NETWORK             │
│                                                                 │
│   x ──▶ [Layer] ──▶ [Layer] ──▶ y    x ──┬──▶ [Layer] ──┬──▶ y │
│              │           │               │              │       │
│              ▼           ▼               └──────────────┘       │
│         Gradient:                              +                │
│         ∂y/∂x = J₁ · J₂                                        │
│                                          Gradient:              │
│         If J₁, J₂ < 1:                   ∂y/∂x = I + J          │
│         Product vanishes                                        │
│                                          Even if J → 0:         │
│         If J₁, J₂ > 1:                   Gradient → I           │
│         Product explodes                 (never vanishes!)      │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

The skip connection creates a gradient highway—a path where gradients flow without attenuation. No matter how deep the network, gradients have a direct route to every layer.

This is why ResNet trained 152 layers when previous networks struggled past 20. The architecture guarantees a floor on gradient magnitude.

Normalization: Controlling the Scale

Residual connections fix the multiplication problem. Normalization fixes another issue: activations drifting out of range.

Without normalization, activations can grow or shrink as they propagate through layers. Large activations lead to large gradients (explosion). Small activations lead to saturation in nonlinearities (vanishing).

Batch Normalization fixes this by standardizing activations:

\[\hat{x} = \frac{x - \mu}{\sigma}\]

Every layer sees inputs with mean 0 and variance 1. No drift. No explosion.

Layer Normalization does the same per-sample (better for variable-length sequences):

\[\hat{x}_i = \frac{x_i - \mu_i}{\sigma_i}\]

Both enforce smoothness by bounding the scale of activations—and therefore the scale of gradients.

The Deeper Effect

Normalization does more than control scale. Research has shown it smooths the loss landscape itself.

The loss surface of an unnormalized network is highly curved—small steps in weight space cause large changes in loss. The optimizer struggles to find a good direction.

Normalization flattens these curves. The loss landscape becomes more predictable. Larger learning rates become safe.

This is why normalized networks train faster, not just more stably. The optimization problem itself becomes easier.

Gradient Clipping: The Hard Limit

Sometimes you want explicit control.

Gradient clipping caps the gradient norm:

\[g \leftarrow \begin{cases} g & \text{if } \|g\| \leq \tau \\ \tau \cdot \frac{g}{\|g\|} & \text{otherwise} \end{cases}\]

If the gradient is too large, scale it down to the threshold $\tau$. Direction is preserved; magnitude is bounded.

This is common in:

RNNs and LSTMs: Long sequences compound gradients
Transformer training: Large models with unstable early dynamics
Reinforcement learning: High-variance gradients from rewards

Gradient clipping is a blunt instrument—it throws away information when gradients exceed the threshold. But it prevents catastrophic divergence, which is often worth the trade-off.

Learning Rate Warmup: Patience at the Start

Why do transformers need learning rate warmup?

At initialization, weights are random. The loss landscape is highly curved. Gradients point in roughly the right direction, but their magnitudes are unreliable.

If you start with a large learning rate, you take large steps in this unreliable landscape. The optimizer overshoots, bounces around, and may never recover.

Warmup solves this:

# Linear warmup over first 1000 steps
if step < warmup_steps:
    lr = max_lr * (step / warmup_steps)
else:
    lr = max_lr * decay_schedule(step)

Small learning rates at the start let the network find a smoother region of the loss landscape. Once it’s in a stable basin, larger learning rates accelerate training.

This is smoothness through patience: wait until the gradients are reliable before trusting them.

Weight Initialization: Starting Smooth

Initialization determines where you start on the loss landscape.

Xavier initialization (2010) sets weights so that activations maintain variance across layers:

\[W \sim \mathcal{N}\left(0, \frac{1}{n_{in}}\right)\]

He initialization (2015) accounts for ReLU’s asymmetry:

\[W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)\]

Both are derived from the same principle: gradients should neither grow nor shrink on average.

If you initialize with $W \sim \mathcal{N}(0, 1)$, activations explode immediately. If you initialize too small, gradients vanish before training begins.

Proper initialization ensures you start in a smooth region where training is possible.

LSTM: Constant Error Flow

Before ResNets, there was LSTM.

Recurrent networks face the gradient problem across time: backpropagating through 1000 timesteps means multiplying 1000 Jacobians.

LSTM (1997) solved this with gated memory:

\[c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t\]

The cell state $c$ flows through time, modulated by gates $f$ (forget) and $i$ (input).

The key insight: when the forget gate $f_t = 1$ and input gate $i_t = 0$, the cell state is copied unchanged:

\[c_t = c_{t-1}\]

The gradient is exactly 1. No vanishing. No explosion. Information (and gradients) flow through time without decay.

This is the same principle as residual connections, discovered 18 years earlier for sequences.

The Activation Function Matters

ReLU replaced sigmoid for a reason.

Sigmoid saturates:

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

For large $

$, the derivative approaches 0. Gradients vanish in saturated regions.

ReLU doesn’t saturate (for positive inputs):

\[\text{ReLU}(x) = \max(0, x)\]

The derivative is exactly 1 for $x > 0$. Gradients pass through unchanged.

ReLU has its own problem—”dead neurons” where $x < 0$ always—but it’s less severe than systematic gradient decay across all neurons.

Modern activations like GELU and SiLU offer smooth approximations to ReLU without the dead neuron problem:

\[\text{GELU}(x) = x \cdot \Phi(x)\]

where $\Phi$ is the Gaussian CDF. Smooth everywhere, approximately linear for large inputs.

The Unified View

Every technique fits the same pattern:

Technique	How It Ensures Smoothness
Residual connections	Gradient = I + something (floor of 1)
Normalization	Bound activation scale → bound gradient scale
Gradient clipping	Explicit gradient magnitude cap
Learning rate warmup	Wait for smooth region before large steps
Weight initialization	Start with unit-variance gradients
LSTM gates	Constant error flow through time
ReLU	Gradient = 1 for positive inputs

All of them prevent gradients from vanishing or exploding. All of them enforce some form of Lipschitz continuity.

This is why modern architectures stack these techniques:

TransformerBlock(x):
    # Residual + Normalization
    x = x + Attention(LayerNorm(x))
    x = x + FFN(LayerNorm(x))
    return x

Residual connections ensure gradient flow. Normalization ensures stable activations. Together, they enable arbitrary depth.

When Smoothness Breaks

These techniques don’t guarantee training will work. They create necessary conditions, not sufficient ones.

Smoothness can break from:

Architecture pathology: Some designs create gradient bottlenecks even with residuals
Extreme depth without scaling: Very deep networks may need additional techniques (ReZero, SkipInit)
Batch size mismatch: BatchNorm statistics become unreliable with very small batches
Numerical precision: FP16 training can underflow gradients that FP32 preserves

Signs of smoothness failure:

Loss goes to NaN or infinity → explosion
Loss plateaus immediately → vanishing
Training only works with tiny learning rates → landscape is too curved
Deeper versions of your model perform worse → gradient degradation

When training fails mysteriously, ask: are gradients flowing?

The Historical Perspective

The history of deep learning is largely the history of solving the smoothness problem:

Year	Breakthrough	Smoothness Technique
1991	Hochreiter identifies vanishing gradients	(Problem discovered)
1997	LSTM	Constant error flow via gating
2010	Xavier initialization	Unit-variance gradients
2012	AlexNet uses ReLU	Non-saturating activations
2015	Batch Normalization	Activation scale control
2015	ResNet	Skip connections
2016	Layer Normalization	Per-sample normalization
2017	Transformer	LayerNorm + Residuals + Warmup
2020+	GPT-3, etc.	All of the above + gradient clipping

Each breakthrough added another tool for gradient control. Modern architectures use all of them.

The Connection to Earlier Articles

Smoothness is the property that makes trainability possible:

Article	What It Enables	Smoothness Connection
Symmetry	Efficient architectures	Architecture design
Linearity	Batching	Forward pass efficiency
Associativity	Parallelization	Computation structure
Smoothness	Trainability	Gradient flow

The other properties tell you what to compute. Smoothness tells you whether you can learn it.

A perfectly designed architecture with the right symmetries and optimal factorization is useless if gradients don’t flow. Smoothness is the foundation that makes everything else trainable.

The Takeaway

Smoothness is the license to go deep.

\[\text{Bounded gradients} \implies \text{Stable training} \implies \text{Arbitrary depth}\]

Every technique that enables deep training—residual connections, normalization, gradient clipping, proper initialization, learning rate warmup—works by enforcing gradient bounds.

The mathematics is simple: gradients multiply through layers. Anything that compounds—shrinking or growing—becomes exponential. Only bounded dynamics remain stable.

This is why:

ResNets train at 152 layers while plain networks fail at 20
Transformers need LayerNorm and warmup
LSTMs revolutionized sequence modeling
ReLU replaced sigmoid

The pattern: keep gradients in a useful range. Everything else follows.

When your deep network won’t train, don’t tune hyperparameters randomly. Ask: where are gradients vanishing or exploding? The answer points to the fix.

The algebra isn’t abstract. It’s why deep learning works at all.

Next: Locality: The License to Focus

Previous: Symmetry: The Property That Designs Architectures

Taras Tsugrii