14 Symmetry
When Structure Enables Sharing
A convolutional neural network has millions of parameters.
But each filter is only 3×3×C weights. The same weights are applied at every spatial location.
This isn’t just compression. It’s exploiting a fundamental symmetry of the problem.
Symmetry exists when a computation is invariant (or equivariant) under some transformation:
\[f(g(x)) = f(x) \quad \text{or} \quad f(g(x)) = g(f(x))\]
When symmetry exists, we can: - Share weights across symmetric positions - Reduce computation by computing once and transforming - Build invariant features that ignore irrelevant variation - Apply Fourier methods for translation-symmetric operations
Symmetry is why CNNs work for images, why Transformers work for sequences, and why certain algorithms are vastly more efficient than brute force.
Emmy Noether proved one of the most profound results in physics [1]: every symmetry corresponds to a conserved quantity. Translation symmetry implies conservation of momentum. Rotation symmetry implies conservation of angular momentum.
This deep connection between symmetry and structure explains why exploiting symmetry is so powerful. When a problem has symmetry, there’s hidden structure—and structure enables efficiency. The weight-sharing in CNNs isn’t just an engineering trick; it’s acknowledging that images have translation symmetry, so a “good feature detector” should work everywhere.
Noether’s abstract approach—studying structure independent of specific elements—is exactly the mindset this book advocates.
14.1 The Power of Invariance
Consider image classification:
Question: "Is there a cat in this image?"
The answer shouldn't change if we:
- Shift the cat 10 pixels left
- Slightly rotate the image
- Change the lighting
These are symmetries of the problem: transformations that don’t change the answer.
A good classifier should be invariant to these transformations. But a naive fully-connected network treats each pixel independently—it has to learn that position doesn’t matter.
CNNs build in translation invariance:
Fully-connected:
- Each output depends on all inputs
- Parameters: H×W×C × num_outputs
- Must learn translation invariance from data
Convolutional:
- Same filter applied everywhere
- Parameters: k×k×C × num_filters
- Translation invariance built in
14.2 Weight Sharing: The Convolution Story
14.2.1 How Convolution Exploits Symmetry
A 2D convolution applies the same kernel everywhere:
def conv2d_explicit(image, kernel):
"""Convolution as weight sharing across positions."""
H, W = image.shape
kH, kW = kernel.shape
output = np.zeros((H - kH + 1, W - kW + 1))
for i in range(output.shape[0]):
for j in range(output.shape[1]):
# Same kernel weights at every position
patch = image[i:i+kH, j:j+kW]
output[i, j] = np.sum(patch * kernel)
return outputThe kernel weights are shared across all spatial positions. This sharing is valid because we assume translation symmetry: features can appear anywhere in the image.
14.2.2 Parameter Reduction
For an image layer mapping 256 channels to 256 channels:
Fully connected: 256 × 256 × H × W parameters
(For 224×224: ~3.3 billion parameters)
Convolution (3×3): 256 × 256 × 3 × 3 = 590K parameters
(5600× fewer parameters)
This isn’t a trick—it’s recognizing that the problem has symmetry.
14.2.3 Translation Equivariance
CNNs are not just invariant to translation at the output. They’re equivariant in the intermediate layers:
Equivariant: Shifting the input shifts the feature maps by the same amount.
Input: [🐱 ] Shift→ [ 🐱]
Features: [✨ ] Shift→ [ ✨]
This equivariance propagates through the network until pooling or fully-connected layers create invariance.
14.3 Fourier Methods: Convolution as Multiplication
The convolution theorem provides a remarkable efficiency gain:
\[\mathcal{F}(f * g) = \mathcal{F}(f) \cdot \mathcal{F}(g)\]
Convolution in the spatial domain equals element-wise multiplication in the Fourier domain.
import numpy as np
from numpy.fft import fft2, ifft2
def conv2d_fourier(image, kernel, shape):
"""Convolution via FFT - exploits circulant structure."""
# Pad kernel to image size
kernel_padded = np.zeros(shape)
kernel_padded[:kernel.shape[0], :kernel.shape[1]] = kernel
# Convolve via FFT
F_image = fft2(image, s=shape)
F_kernel = fft2(kernel_padded, s=shape)
F_result = F_image * F_kernel
return np.real(ifft2(F_result))14.3.1 Complexity Comparison
Direct convolution: O(N² × K²) for N×N image, K×K kernel
FFT convolution: O(N² log N) for any kernel size
For large kernels (K > ~15), FFT wins. This is exploiting the translation symmetry of convolution—the operation’s structure enables the Fourier shortcut.
14.4 Group Equivariance: Beyond Translation
Translation is just one symmetry. What about rotation, scaling, or reflection?
14.4.1 The Limitation of Standard CNNs
A standard CNN learns rotation invariance from data:
Training data: Cats at all orientations
Network: Learns separate "cat at 0°", "cat at 45°", "cat at 90°" detectors
Problem: Wasteful—these are the same feature rotated
14.4.2 Group Equivariant Networks
Group equivariant CNNs (G-CNNs) build rotation equivariance into the architecture:
# Conceptually: a G-CNN layer
def g_conv2d(input, kernel, group):
"""Convolution equivariant to a symmetry group."""
outputs = []
for g in group.elements(): # e.g., 0°, 90°, 180°, 270° rotations
rotated_kernel = group.transform(kernel, g)
output = conv2d(input, rotated_kernel)
outputs.append(output)
return stack(outputs)The output has an extra dimension: orientation. The same kernel detects the same feature at all orientations, with orientation explicitly tracked.
14.4.3 Benefits of Group Equivariance
Standard CNN: Learns 4 copies of "vertical edge" (0°, 90°, 180°, 270°)
G-CNN: Learns 1 "edge" kernel, applied at 4 orientations
Parameter reduction: 4×
Sample efficiency: Better generalization
Applications: - Medical imaging: Tumors can appear at any orientation - Satellite imagery: No “up” direction - Molecular modeling: Molecules rotate freely
14.5 Symmetry in Attention
The Transformer architecture has its own symmetry properties.
14.5.1 Permutation Equivariance
Self-attention is permutation equivariant for the input set:
def self_attention(X):
"""Self-attention is equivariant to input permutation."""
Q = X @ W_Q
K = X @ W_K
V = X @ W_V
return softmax(Q @ K.T / sqrt(d)) @ VIf you permute the rows of X, the output is permuted the same way. This makes attention suitable for sets (like point clouds) where there’s no inherent ordering.
14.5.2 Breaking Symmetry with Position
For sequences, we want position to matter. Position embeddings break the permutation symmetry:
def transformer_with_position(X, positions):
"""Position embeddings break permutation symmetry."""
X = X + position_embedding(positions) # Break symmetry
return self_attention(X)This is a design choice: Transformers start permutation-equivariant and add position information.
Rotary Position Embeddings (RoPE) exploit a different symmetry to encode position. RoPE applies 2D rotations to pairs of embedding dimensions, leveraging the SO(2) rotation group’s special property:
\[R_m^T R_n = R_{n-m}\]
When computing attention \((R_m q)^T (R_n k) = q^T R_{n-m} k\), only the relative position remains—exactly what attention needs.
This is symmetry exploitation in a sophisticated form: the rotation group’s algebraic structure encodes relative position naturally. See Chapter: Long-Context Attention for the full derivation.
14.6 Symmetry in Optimization
Some optimizations exploit symmetry of the parameter space.
14.6.1 Weight Initialization
Neural network loss surfaces have symmetries: - Permuting neurons in a layer doesn’t change the function - Scaling one layer and inversely scaling the next is equivalent
Good initialization respects these symmetries:
def he_initialization(fan_in, fan_out):
"""He initialization: symmetric about zero, scaled to preserve variance."""
std = np.sqrt(2.0 / fan_in)
return np.random.randn(fan_in, fan_out) * std14.6.2 Batch Normalization
BatchNorm exploits a symmetry: scaling and shifting activations doesn’t fundamentally change what the layer can compute.
Before BN: Layer must learn scale and shift from data
With BN: Scale/shift factored out, then re-added as learnable
The symmetric "scale-shift" is handled separately from the "what to compute"
14.7 Exploiting Symmetry in Practice
14.7.1 Pattern: Identify the Symmetry Group
For any problem, ask: “What transformations leave the answer unchanged?”
| Problem | Symmetry | How to Exploit |
|---|---|---|
| Image classification | Translation | Convolution |
| Image (any orientation) | Translation + rotation | G-CNNs |
| Point clouds | Permutation | PointNet, attention |
| Molecules | SE(3) (rotation + translation) | Equivariant GNNs |
| Time series | Translation | 1D convolution |
14.7.2 Pattern: Build Equivariance, Then Invariance
Input → [Equivariant layers] → [Pooling] → [Invariant output]
Equivariant: Preserve information about transformation
Invariant: Discard transformation, keep content
Example in CNN:
Image → [Conv, Conv, Conv] → [Global Average Pool] → Class prediction
Equivariant Creates invariance
14.7.3 Pattern: Fourier for Translation-Symmetric Operations
When computation has translation structure: - Convolution → FFT - Circular correlation → FFT - Filtering → FFT
The FFT exploits the circulant symmetry of these operations.
14.8 The Symmetry Design Space
When designing architectures, symmetry is a key choice:
More Symmetry ←─────────────────→ Less Symmetry
│ │
│ • G-CNNs │
│ • Full rotation invariance │
│ • Less expressive │
│ │
│ CNN │
│ Translation only │
│ │
│ │ • MLP
│ │ • No built-in symmetry
│ │ • Most expressive
└───────────────────────────────────┘
More symmetry means: - Fewer parameters (sharing) - Better sample efficiency (built-in inductive bias) - Less expressiveness (can’t break the symmetry)
The right choice depends on whether the symmetry actually holds for your problem.
14.9 Key Takeaways
Symmetry = transformation invariance: When the answer doesn’t change under transformation, build that into the architecture.
Weight sharing exploits symmetry: Convolution shares weights because images have translation symmetry.
Equivariance preserves information: Features should transform with the input, creating invariance only at the output.
Fourier methods exploit circulant structure: Convolution is multiplication in Fourier space—a massive efficiency gain.
Design choice: Match the symmetry group of your problem. Too much symmetry loses expressiveness; too little wastes parameters.
14.10 The Hardware Connection
Symmetry optimizations interact with hardware in specific ways:
Parameter reduction → Memory savings: A CNN with 3×3 kernels uses 9× fewer parameters than an equivalent fully-connected layer per spatial position. This means 9× less memory bandwidth to load weights, critical for memory-bound inference.
Fourier convolution → Compute savings: FFT-based convolution runs in O(n log n) vs O(n²) for large kernels. On GPUs, cuFFT is highly optimized and the crossover point for when FFT convolution is faster than direct convolution is typically around kernel size 11-15.
Weight sharing → Cache efficiency: Shared weights are loaded once and reused across spatial positions. This dramatically improves cache hit rates compared to unique weights per position.
The Connection:
Mathematical Property Hardware Constraint Exploitation
───────────────────────────────────────────────────────────────
Translation symmetry × Memory bandwidth → Weight sharing (CNN)
Circulant structure × FFT hardware → Fourier convolution
Rotation equivariance × Parameter count → G-CNNs
14.11 Exercises
A fully-connected layer maps a 32×32 image (1024 pixels) to a 1024-dimensional output.
- How many parameters does the FC layer have?
- A CNN with 32 3×3 filters on the same input has how many parameters?
- What is the compression ratio? What symmetry assumption enables this?
Solution: (a) 1024 × 1024 = 1,048,576 parameters. (b) 32 × (3 × 3 × 1) + 32 biases = 320 parameters. (c) ~3,278× compression. This is enabled by translation symmetry — the assumption that the same pattern detector is useful at every spatial position.
Give an example where translation symmetry does not hold for images. What happens to CNN performance in this case?
Solution: Object detection at image boundaries — features near edges are systematically different from center features (they have less context). Handwriting recognition where character position matters (e.g., subscripts vs. superscripts). In these cases, CNNs may underperform position-aware architectures, or need positional encoding to break the symmetry.
Direct convolution of a signal of length \(n\) with a kernel of size \(k\) costs \(O(nk)\) operations. FFT convolution costs \(O(n \log n)\) (independent of \(k\)).
- At what kernel size \(k\) does FFT convolution become cheaper?
- For \(n = 1024\), what is the crossover \(k\)?
- Why isn’t FFT convolution always used for 3×3 kernels in deep learning?
Solution: (a) When \(nk > n \log n\), i.e., \(k > \log n\). (b) \(\log_2(1024) = 10\), so \(k > 10\). (c) For small kernels (3×3), direct convolution is faster due to lower constant factors, better cache behavior, and optimized cuDNN implementations. FFT has overhead from the forward/inverse transforms and requires complex arithmetic.
14.12 Further Reading
- Cohen & Welling (2016). “Group Equivariant Convolutional Networks”
- Bronstein et al. (2021). “Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges” — comprehensive treatment of symmetry in deep learning
- Noether, E. (1918). “Invariante Variationsprobleme” — the original symmetry-conservation connection
The accompanying notebook walks through:
- Comparing CNN vs. FC parameter counts and performance
- Implementing FFT convolution and measuring the crossover point
- Visualizing learned CNN filters to see translation symmetry in action
Notebook support for this chapter is in progress. For now, run the exercises locally and record crossover points on your hardware.
These problems have no provided solutions. They require combining multiple properties.
Challenge 1: Property Discovery Take an optimization you use regularly (e.g., gradient checkpointing, mixed precision training, weight tying in language models). Which of the six properties does it exploit? Can you identify one that uses properties from BOTH tiers (algebraic + structural)?
Challenge 2: Novel Combination The framework identifies six properties. Most known optimizations use 1-2. Can you design an optimization that simultaneously exploits THREE or more properties? Describe the problem, the properties, and the optimization. (Hint: consider a sparse, low-rank, quantized attention mechanism.)
Challenge 3: Counter-Framework Find a significant optimization that genuinely doesn’t fit any of the six properties. What does this tell you about the framework’s boundaries? (This is a serious question — the framework should be tested honestly.)
Part III: Algorithm Investigations →