13  Redundancy

When Less Information Means More Speed

A 70-billion-parameter model stores weights in 32-bit floats: 280 GB of data.

Reduce to 4 bits: 35 GB. The model barely notices.

We just threw away 87.5% of the bits. Why does it still work?

NoteProperty Spotlight: Redundancy

Redundancy exists when the information content of data is less than its representation size.

\[H(X) < \text{bits}(X)\]

When data is redundant, we can: - Compress without losing essential information - Quantize to lower precision - Cache computed results (computation is “redundant” if repeated) - Approximate with simpler representations

This chapter explores how redundancy enables some of the most practical optimizations in ML systems.

TipHistorical Note: Shannon’s Information Theory (1948)

Claude Shannon’s “A Mathematical Theory of Communication” formalized the concept of information. His key insight: the information content of a message is distinct from its representation. A message with predictable patterns carries less information than its bit-length suggests.

Shannon defined entropy as the measure of true information content. When entropy is less than representation size, redundancy exists—and compression is possible. This principle underlies quantization: neural network weights have low entropy (they’re structured, not random), so aggressive bit-reduction works.

13.1 The Information-Theoretic View

Claude Shannon defined information as surprise. If you can predict something, it carries no information.

Neural network weights are not random. They’re structured:

Typical weight distribution:
- Concentrated near zero
- Roughly Gaussian
- Correlated across layers
- Low effective rank (as we saw with LoRA)

This structure means the weights contain less information than their representation suggests.

13.1.1 Measuring Redundancy

For a weight tensor W stored in b bits per element:

\[\text{Redundancy} = b - H(W)\]

where H(W) is the entropy of the weight distribution.

Empirically, for trained neural networks: - FP32 uses 32 bits per weight - Actual entropy: often 4-8 bits per weight - Redundancy: 24-28 bits per weight

This is why quantization works.

13.2 Quantization: Exploiting Numeric Redundancy

Quantization reduces the bits per value:

FP32 → FP16:  32 → 16 bits  (2× compression)
FP16 → INT8:  16 → 8 bits   (2× compression)
INT8 → INT4:   8 → 4 bits   (2× compression)
FP32 → INT4:  32 → 4 bits   (8× compression)

13.2.1 Why It Works

Neural networks are trained with noise: - Stochastic gradient descent - Dropout - Data augmentation - Mini-batch variance

The training process makes networks robust to small perturbations. Quantization error is just another small perturbation.

import torch

def quantization_error_analysis(weights, bits=8):
    """Analyze quantization error relative to weight magnitude."""
    # Quantize
    scale = (weights.max() - weights.min()) / (2**bits - 1)
    quantized = torch.round((weights - weights.min()) / scale)
    dequantized = quantized * scale + weights.min()

    # Error analysis
    error = (weights - dequantized).abs()
    relative_error = error / (weights.abs() + 1e-10)

    print(f"Bits: {bits}")
    print(f"Max absolute error: {error.max():.6f}")
    print(f"Mean relative error: {relative_error.mean():.2%}")
    print(f"Weights affected >1%: {(relative_error > 0.01).float().mean():.2%}")

13.2.2 The Outlier Problem

Not all weights are equal. Some have outsize impact:

Weight distribution in LLMs:
    │
    │   ████████████████
    │  ██████████████████████
    │ █████████████████████████
    │██████████████████████████████
────┼──────────────────────────────────•──────
    │                                outliers
    └─────────────────────────────────────────→

A few outlier weights carry critical information. Quantizing them aggressively destroys model quality.

Solutions: - Mixed precision: Keep outliers in higher precision - Per-channel scaling: Different scale factors for different channels - GPTQ/AWQ: Optimize quantization to minimize output error

def mixed_precision_quantize(weights, outlier_threshold=3.0):
    """Keep outliers in FP16, quantize the rest to INT4."""
    std = weights.std()
    outlier_mask = weights.abs() > outlier_threshold * std

    # Outliers stay FP16
    outliers = weights[outlier_mask]

    # Rest goes to INT4
    normal = weights[~outlier_mask]
    normal_quantized = quantize_to_int4(normal)

    return outliers, outlier_mask, normal_quantized

13.3 Caching: Exploiting Computational Redundancy

If you compute the same thing twice, the second computation is redundant.

13.3.1 KV Cache in Autoregressive Generation

During LLM generation:

Step 1: Compute K, V for token 1         → Store K₁, V₁
Step 2: Compute K, V for token 2         → Store K₂, V₂
        Attention needs K₁, V₁           → Retrieved from cache
Step 3: Compute K, V for token 3         → Store K₃, V₃
        Attention needs K₁, V₁, K₂, V₂   → Retrieved from cache
...

Without cache: O(n²) key-value computations for n tokens. With cache: O(n) key-value computations.

The redundancy: At step n, we’d recompute K₁…K_{n-1} and V₁…V_{n-1}. But they’re identical to what we computed before.

13.3.2 Prefix Caching

Many prompts share prefixes:

Request 1: "You are a helpful assistant. User: What is Python?"
Request 2: "You are a helpful assistant. User: What is Java?"
Request 3: "You are a helpful assistant. User: What is Rust?"

The system prompt is identical. Computing KV for it three times is redundant.

Prefix caching stores KV for common prefixes:

class PrefixCache:
    def __init__(self, max_size_gb=10):
        self.cache = {}  # hash(prefix) → KV tensors
        self.max_size = max_size_gb * 1e9

    def get_or_compute(self, prefix_tokens, model):
        key = hash(tuple(prefix_tokens))

        if key in self.cache:
            return self.cache[key]  # Cache hit!

        # Cache miss: compute and store
        kv = model.compute_kv(prefix_tokens)
        self.cache[key] = kv
        self.maybe_evict()

        return kv

13.3.3 Semantic Caching

Even more aggressive: cache based on meaning, not exact match.

Request 1: "What's the capital of France?"
Request 2: "What is the capital city of France?"
Request 3: "France's capital?"

These are semantically identical. If we’ve answered one, we can return the cached answer for the others.

def semantic_cache_lookup(query, cache, threshold=0.95):
    """Find semantically similar cached queries."""
    query_embedding = embed(query)

    for cached_query, cached_response in cache.items():
        similarity = cosine_similarity(query_embedding, embed(cached_query))
        if similarity > threshold:
            return cached_response  # Semantic cache hit!

    return None  # Cache miss

13.4 Approximation: Trading Precision for Speed

Sometimes exact computation is redundant for the use case.

13.4.1 Speculative Decoding

Use a small “draft” model to generate candidates, verify with the large model:

Draft model (1B params):  Generates 5 tokens quickly
Large model (70B params): Verifies/corrects in one forward pass

If draft is 80% accurate:
  - Expected accepted tokens per step: ~4
  - Speedup: ~3-4×

The draft model’s approximate answers are “good enough” most of the time. We’re exploiting the redundancy between what the small and large models produce.

13.4.2 Approximate Attention

Full attention: Every token attends to every other token.

But most attention weights are near zero:

Typical attention pattern:

  Q │ ██░░░░░░░░░░░░░░██
  u │ ███░░░░░░░░░░░░███
  e │ ░███░░░░░░░░░░███░
  r │ ░░███░░░░░░░░███░░
  y │ ░░░████░░░░████░░░
    │ ░░░░░█████████░░░░
    │ ░░░░░░░██████░░░░░
    └─────────────────────
              Key

█ = High attention (matters)
░ = Low attention (redundant)

Sparse attention exploits this: only compute the non-zero parts.

13.5 Redundancy in Model Architecture

13.5.1 Weight Sharing

CNNs share weights across spatial positions:

Image convolution:
Same 3×3 kernel applied everywhere
  - Instead of: H×W×9 parameters per layer
  - We have: 9 parameters per layer
  - Compression: H×W × (often >1000×)

This works because images have translation invariance: the same features appear in different locations.

13.5.2 Knowledge Distillation

A student network learns to mimic a teacher:

Teacher (100B params) → [Knowledge] → Student (1B params)

The student captures the “essence” of the teacher’s knowledge. The redundancy is in the teacher—it uses far more parameters than necessary to represent what it knows.

13.5.3 Mixture of Experts Routing

MoE activates only k of N experts per token:

N = 64 experts, k = 2 activated

The 62 non-activated experts are "redundant" for this token.
Total capacity: 64× a dense model
Compute cost: 2× a single expert ≈ dense model

Different tokens need different experts. The redundancy is conditional—what’s redundant depends on the input.

13.6 The Redundancy-Quality Tradeoff

All redundancy exploitation involves a tradeoff:

More Compression ←────────────────────→ Higher Quality
     │                                        │
     │    • 2-bit quantization               │
     │    • Aggressive pruning               │
     │    • Small draft models               │
     │                                        │
     │              Sweet spot               │
     │                 ↓                      │
     │            • 4-bit quant              │
     │            • Moderate pruning         │
     │            • Good draft models        │
     │                                        │
     │                                        │    • 16-bit
     │                                        │    • No pruning
     └────────────────────────────────────────┘

Finding the sweet spot:

  1. Measure baseline quality (perplexity, accuracy, human eval)
  2. Apply compression incrementally
  3. Track quality degradation
  4. Stop when quality/size ratio is optimal
def find_optimal_quantization(model, eval_dataset, quality_threshold=0.95):
    """Find the most aggressive quantization that maintains quality."""
    baseline = evaluate(model, eval_dataset)

    for bits in [8, 6, 4, 3, 2]:
        quantized = quantize(model, bits=bits)
        quality = evaluate(quantized, eval_dataset)

        if quality / baseline >= quality_threshold:
            print(f"{bits}-bit maintains {quality/baseline:.1%} of quality")
            return bits, quantized
        else:
            print(f"{bits}-bit degrades to {quality/baseline:.1%} - too aggressive")

    return 8, quantize(model, bits=8)  # Fall back to INT8

13.7 Key Takeaways

TipRedundancy Enables Compression
  1. Information < Representation: Neural network weights contain far less information than their bit-width suggests.

  2. Quantization exploits numeric redundancy: 4-bit weights often match 32-bit quality because the extra precision is noise.

  3. Caching exploits computational redundancy: Don’t recompute what you’ve already computed.

  4. Approximation exploits accuracy redundancy: “Good enough” is often good enough.

  5. The tradeoff is measurable: Quantify quality vs. compression and find the sweet spot.


Next: Symmetry →