14  Symmetry

When Structure Enables Sharing

A convolutional neural network has millions of parameters.

But each filter is only 3×3×C weights. The same weights are applied at every spatial location.

This isn’t just compression. It’s exploiting a fundamental symmetry of the problem.

NoteProperty Spotlight: Symmetry

Symmetry exists when a computation is invariant (or equivariant) under some transformation:

\[f(g(x)) = f(x) \quad \text{or} \quad f(g(x)) = g(f(x))\]

When symmetry exists, we can: - Share weights across symmetric positions - Reduce computation by computing once and transforming - Build invariant features that ignore irrelevant variation - Apply Fourier methods for translation-symmetric operations

Symmetry is why CNNs work for images, why Transformers work for sequences, and why certain algorithms are vastly more efficient than brute force.

TipHistorical Note: Noether’s Theorem (1918)

Emmy Noether proved one of the most profound results in physics [1]: every symmetry corresponds to a conserved quantity. Translation symmetry implies conservation of momentum. Rotation symmetry implies conservation of angular momentum.

This deep connection between symmetry and structure explains why exploiting symmetry is so powerful. When a problem has symmetry, there’s hidden structure—and structure enables efficiency. The weight-sharing in CNNs isn’t just an engineering trick; it’s acknowledging that images have translation symmetry, so a “good feature detector” should work everywhere.

Noether’s abstract approach—studying structure independent of specific elements—is exactly the mindset this book advocates.

14.1 The Power of Invariance

Consider image classification:

Question: "Is there a cat in this image?"

The answer shouldn't change if we:
- Shift the cat 10 pixels left
- Slightly rotate the image
- Change the lighting

These are symmetries of the problem: transformations that don’t change the answer.

A good classifier should be invariant to these transformations. But a naive fully-connected network treats each pixel independently—it has to learn that position doesn’t matter.

CNNs build in translation invariance:

Fully-connected:
  - Each output depends on all inputs
  - Parameters: H×W×C × num_outputs
  - Must learn translation invariance from data

Convolutional:
  - Same filter applied everywhere
  - Parameters: k×k×C × num_filters
  - Translation invariance built in

14.2 Weight Sharing: The Convolution Story

14.2.1 How Convolution Exploits Symmetry

A 2D convolution applies the same kernel everywhere:

def conv2d_explicit(image, kernel):
    """Convolution as weight sharing across positions."""
    H, W = image.shape
    kH, kW = kernel.shape
    output = np.zeros((H - kH + 1, W - kW + 1))

    for i in range(output.shape[0]):
        for j in range(output.shape[1]):
            # Same kernel weights at every position
            patch = image[i:i+kH, j:j+kW]
            output[i, j] = np.sum(patch * kernel)

    return output

The kernel weights are shared across all spatial positions. This sharing is valid because we assume translation symmetry: features can appear anywhere in the image.

14.2.2 Parameter Reduction

For an image layer mapping 256 channels to 256 channels:

Fully connected: 256 × 256 × H × W parameters
                 (For 224×224: ~3.3 billion parameters)

Convolution (3×3): 256 × 256 × 3 × 3 = 590K parameters
                   (5600× fewer parameters)

This isn’t a trick—it’s recognizing that the problem has symmetry.

14.2.3 Translation Equivariance

CNNs are not just invariant to translation at the output. They’re equivariant in the intermediate layers:

Equivariant: Shifting the input shifts the feature maps by the same amount.

Input:    [🐱    ]     Shift→    [    🐱]
Features: [✨    ]     Shift→    [    ✨]

This equivariance propagates through the network until pooling or fully-connected layers create invariance.

14.3 Fourier Methods: Convolution as Multiplication

The convolution theorem provides a remarkable efficiency gain:

\[\mathcal{F}(f * g) = \mathcal{F}(f) \cdot \mathcal{F}(g)\]

Convolution in the spatial domain equals element-wise multiplication in the Fourier domain.

import numpy as np
from numpy.fft import fft2, ifft2

def conv2d_fourier(image, kernel, shape):
    """Convolution via FFT - exploits circulant structure."""
    # Pad kernel to image size
    kernel_padded = np.zeros(shape)
    kernel_padded[:kernel.shape[0], :kernel.shape[1]] = kernel

    # Convolve via FFT
    F_image = fft2(image, s=shape)
    F_kernel = fft2(kernel_padded, s=shape)
    F_result = F_image * F_kernel

    return np.real(ifft2(F_result))

14.3.1 Complexity Comparison

Direct convolution: O(N² × K²)  for N×N image, K×K kernel
FFT convolution:    O(N² log N)  for any kernel size

For large kernels (K > ~15), FFT wins. This is exploiting the translation symmetry of convolution—the operation’s structure enables the Fourier shortcut.

14.4 Group Equivariance: Beyond Translation

Translation is just one symmetry. What about rotation, scaling, or reflection?

14.4.1 The Limitation of Standard CNNs

A standard CNN learns rotation invariance from data:

Training data: Cats at all orientations
Network: Learns separate "cat at 0°", "cat at 45°", "cat at 90°" detectors
Problem: Wasteful—these are the same feature rotated

14.4.2 Group Equivariant Networks

Group equivariant CNNs (G-CNNs) build rotation equivariance into the architecture:

# Conceptually: a G-CNN layer
def g_conv2d(input, kernel, group):
    """Convolution equivariant to a symmetry group."""
    outputs = []
    for g in group.elements():  # e.g., 0°, 90°, 180°, 270° rotations
        rotated_kernel = group.transform(kernel, g)
        output = conv2d(input, rotated_kernel)
        outputs.append(output)
    return stack(outputs)

The output has an extra dimension: orientation. The same kernel detects the same feature at all orientations, with orientation explicitly tracked.

14.4.3 Benefits of Group Equivariance

Standard CNN:  Learns 4 copies of "vertical edge" (0°, 90°, 180°, 270°)
G-CNN:         Learns 1 "edge" kernel, applied at 4 orientations

Parameter reduction: 4×
Sample efficiency:   Better generalization

Applications: - Medical imaging: Tumors can appear at any orientation - Satellite imagery: No “up” direction - Molecular modeling: Molecules rotate freely

14.5 Symmetry in Attention

The Transformer architecture has its own symmetry properties.

14.5.1 Permutation Equivariance

Self-attention is permutation equivariant for the input set:

def self_attention(X):
    """Self-attention is equivariant to input permutation."""
    Q = X @ W_Q
    K = X @ W_K
    V = X @ W_V
    return softmax(Q @ K.T / sqrt(d)) @ V

If you permute the rows of X, the output is permuted the same way. This makes attention suitable for sets (like point clouds) where there’s no inherent ordering.

14.5.2 Breaking Symmetry with Position

For sequences, we want position to matter. Position embeddings break the permutation symmetry:

def transformer_with_position(X, positions):
    """Position embeddings break permutation symmetry."""
    X = X + position_embedding(positions)  # Break symmetry
    return self_attention(X)

This is a design choice: Transformers start permutation-equivariant and add position information.

NoteRotation Symmetry in Positional Encoding

Rotary Position Embeddings (RoPE) exploit a different symmetry to encode position. RoPE applies 2D rotations to pairs of embedding dimensions, leveraging the SO(2) rotation group’s special property:

\[R_m^T R_n = R_{n-m}\]

When computing attention \((R_m q)^T (R_n k) = q^T R_{n-m} k\), only the relative position remains—exactly what attention needs.

This is symmetry exploitation in a sophisticated form: the rotation group’s algebraic structure encodes relative position naturally. See Chapter: Long-Context Attention for the full derivation.

14.6 Symmetry in Optimization

Some optimizations exploit symmetry of the parameter space.

14.6.1 Weight Initialization

Neural network loss surfaces have symmetries: - Permuting neurons in a layer doesn’t change the function - Scaling one layer and inversely scaling the next is equivalent

Good initialization respects these symmetries:

def he_initialization(fan_in, fan_out):
    """He initialization: symmetric about zero, scaled to preserve variance."""
    std = np.sqrt(2.0 / fan_in)
    return np.random.randn(fan_in, fan_out) * std

14.6.2 Batch Normalization

BatchNorm exploits a symmetry: scaling and shifting activations doesn’t fundamentally change what the layer can compute.

Before BN: Layer must learn scale and shift from data
With BN:   Scale/shift factored out, then re-added as learnable

The symmetric "scale-shift" is handled separately from the "what to compute"

14.7 Exploiting Symmetry in Practice

14.7.1 Pattern: Identify the Symmetry Group

For any problem, ask: “What transformations leave the answer unchanged?”

Problem Symmetry How to Exploit
Image classification Translation Convolution
Image (any orientation) Translation + rotation G-CNNs
Point clouds Permutation PointNet, attention
Molecules SE(3) (rotation + translation) Equivariant GNNs
Time series Translation 1D convolution

14.7.2 Pattern: Build Equivariance, Then Invariance

Input → [Equivariant layers] → [Pooling] → [Invariant output]

Equivariant: Preserve information about transformation
Invariant:   Discard transformation, keep content

Example in CNN:

Image → [Conv, Conv, Conv] → [Global Average Pool] → Class prediction
         Equivariant            Creates invariance

14.7.3 Pattern: Fourier for Translation-Symmetric Operations

When computation has translation structure: - Convolution → FFT - Circular correlation → FFT - Filtering → FFT

The FFT exploits the circulant symmetry of these operations.

14.8 The Symmetry Design Space

When designing architectures, symmetry is a key choice:

More Symmetry ←─────────────────→ Less Symmetry
     │                                   │
     │  • G-CNNs                         │
     │  • Full rotation invariance       │
     │  • Less expressive                │
     │                                   │
     │          CNN                      │
     │          Translation only         │
     │                                   │
     │                                   │  • MLP
     │                                   │  • No built-in symmetry
     │                                   │  • Most expressive
     └───────────────────────────────────┘

More symmetry means: - Fewer parameters (sharing) - Better sample efficiency (built-in inductive bias) - Less expressiveness (can’t break the symmetry)

The right choice depends on whether the symmetry actually holds for your problem.

14.9 Key Takeaways

TipSymmetry Enables Structure
  1. Symmetry = transformation invariance: When the answer doesn’t change under transformation, build that into the architecture.

  2. Weight sharing exploits symmetry: Convolution shares weights because images have translation symmetry.

  3. Equivariance preserves information: Features should transform with the input, creating invariance only at the output.

  4. Fourier methods exploit circulant structure: Convolution is multiplication in Fourier space—a massive efficiency gain.

  5. Design choice: Match the symmetry group of your problem. Too much symmetry loses expressiveness; too little wastes parameters.

14.10 The Hardware Connection

Symmetry optimizations interact with hardware in specific ways:

Parameter reduction → Memory savings: A CNN with 3×3 kernels uses 9× fewer parameters than an equivalent fully-connected layer per spatial position. This means 9× less memory bandwidth to load weights, critical for memory-bound inference.

Fourier convolution → Compute savings: FFT-based convolution runs in O(n log n) vs O(n²) for large kernels. On GPUs, cuFFT is highly optimized and the crossover point for when FFT convolution is faster than direct convolution is typically around kernel size 11-15.

Weight sharing → Cache efficiency: Shared weights are loaded once and reused across spatial positions. This dramatically improves cache hit rates compared to unique weights per position.

The Connection:

Mathematical Property     Hardware Constraint       Exploitation
───────────────────────────────────────────────────────────────
Translation symmetry  ×   Memory bandwidth     →   Weight sharing (CNN)
Circulant structure   ×   FFT hardware         →   Fourier convolution
Rotation equivariance ×   Parameter count      →   G-CNNs

14.11 Exercises

A fully-connected layer maps a 32×32 image (1024 pixels) to a 1024-dimensional output.

  1. How many parameters does the FC layer have?
  2. A CNN with 32 3×3 filters on the same input has how many parameters?
  3. What is the compression ratio? What symmetry assumption enables this?

Solution: (a) 1024 × 1024 = 1,048,576 parameters. (b) 32 × (3 × 3 × 1) + 32 biases = 320 parameters. (c) ~3,278× compression. This is enabled by translation symmetry — the assumption that the same pattern detector is useful at every spatial position.

Give an example where translation symmetry does not hold for images. What happens to CNN performance in this case?

Solution: Object detection at image boundaries — features near edges are systematically different from center features (they have less context). Handwriting recognition where character position matters (e.g., subscripts vs. superscripts). In these cases, CNNs may underperform position-aware architectures, or need positional encoding to break the symmetry.

Direct convolution of a signal of length \(n\) with a kernel of size \(k\) costs \(O(nk)\) operations. FFT convolution costs \(O(n \log n)\) (independent of \(k\)).

  1. At what kernel size \(k\) does FFT convolution become cheaper?
  2. For \(n = 1024\), what is the crossover \(k\)?
  3. Why isn’t FFT convolution always used for 3×3 kernels in deep learning?

Solution: (a) When \(nk > n \log n\), i.e., \(k > \log n\). (b) \(\log_2(1024) = 10\), so \(k > 10\). (c) For small kernels (3×3), direct convolution is faster due to lower constant factors, better cache behavior, and optimized cuDNN implementations. FFT has overhead from the forward/inverse transforms and requires complex arithmetic.

14.12 Further Reading

  • Cohen & Welling (2016). “Group Equivariant Convolutional Networks”
  • Bronstein et al. (2021). “Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges” — comprehensive treatment of symmetry in deep learning
  • Noether, E. (1918). “Invariante Variationsprobleme” — the original symmetry-conservation connection
NoteTry It Yourself

The accompanying notebook walks through:

  • Comparing CNN vs. FC parameter counts and performance
  • Implementing FFT convolution and measuring the crossover point
  • Visualizing learned CNN filters to see translation symmetry in action

Notebook support for this chapter is in progress. For now, run the exercises locally and record crossover points on your hardware.


ImportantPart II Challenge Problems

These problems have no provided solutions. They require combining multiple properties.

Challenge 1: Property Discovery Take an optimization you use regularly (e.g., gradient checkpointing, mixed precision training, weight tying in language models). Which of the six properties does it exploit? Can you identify one that uses properties from BOTH tiers (algebraic + structural)?

Challenge 2: Novel Combination The framework identifies six properties. Most known optimizations use 1-2. Can you design an optimization that simultaneously exploits THREE or more properties? Describe the problem, the properties, and the optimization. (Hint: consider a sparse, low-rank, quantized attention mechanism.)

Challenge 3: Counter-Framework Find a significant optimization that genuinely doesn’t fit any of the six properties. What does this tell you about the framework’s boundaries? (This is a serious question — the framework should be tested honestly.)


Part III: Algorithm Investigations →

[1]
E. Noether, “Invariante variationsprobleme,” Nachrichten von der Gesellschaft der Wissenschaften zu Göttingen, pp. 235–257, 1918.