13 Redundancy

When Less Information Means More Speed

A 70-billion-parameter model stores weights in 32-bit floats: 280 GB of data.

Reduce to 4 bits: 35 GB. The model barely notices.

We just threw away 87.5% of the bits. Why does it still work?

Property Spotlight: Redundancy

Redundancy exists when the information content of data is less than its representation size.

\[H(X) < \text{bits}(X)\]

When data is redundant, we can: - Compress without losing essential information - Quantize to lower precision - Cache computed results (computation is “redundant” if repeated) - Approximate with simpler representations

This chapter explores how redundancy enables some of the most practical optimizations in ML systems.

Historical Note: Shannon’s Information Theory (1948)

Claude Shannon’s “A Mathematical Theory of Communication” [1] formalized the concept of information. His key insight: the information content of a message is distinct from its representation. A message with predictable patterns carries less information than its bit-length suggests.

Shannon defined entropy as the measure of true information content. When entropy is less than representation size, redundancy exists—and compression is possible. This principle underlies quantization: neural network weights have low entropy (they’re structured, not random), so aggressive bit-reduction works.

13.1 The Information-Theoretic View

Claude Shannon defined information as surprise. If you can predict something, it carries no information.

Neural network weights are not random. They’re structured:

Typical weight distribution:
- Concentrated near zero
- Roughly Gaussian
- Correlated across layers
- Low effective rank (as we saw with LoRA)

This structure means the weights contain less information than their representation suggests.

13.1.1 Measuring Redundancy

For a weight tensor W stored in b bits per element:

\[\text{Redundancy} = b - H(W)\]

where H(W) is the entropy of the weight distribution.

Empirically, for trained neural networks: - FP32 uses 32 bits per weight - Actual entropy: often 4-8 bits per weight - Redundancy: 24-28 bits per weight

This is why quantization works.

13.2 Quantization: Exploiting Numeric Redundancy

Quantization reduces the bits per value:

FP32 → FP16:  32 → 16 bits  (2× compression)
FP16 → INT8:  16 → 8 bits   (2× compression)
INT8 → INT4:   8 → 4 bits   (2× compression)
FP32 → INT4:  32 → 4 bits   (8× compression)

13.2.1 Why It Works

Neural networks are trained with noise: - Stochastic gradient descent - Dropout - Data augmentation - Mini-batch variance

The training process makes networks robust to small perturbations. Quantization error is just another small perturbation.

import torch

def quantization_error_analysis(weights, bits=8):
    """Analyze quantization error relative to weight magnitude."""
    # Quantize
    scale = (weights.max() - weights.min()) / (2**bits - 1)
    quantized = torch.round((weights - weights.min()) / scale)
    dequantized = quantized * scale + weights.min()

    # Error analysis
    error = (weights - dequantized).abs()
    relative_error = error / (weights.abs() + 1e-10)

    print(f"Bits: {bits}")
    print(f"Max absolute error: {error.max():.6f}")
    print(f"Mean relative error: {relative_error.mean():.2%}")
    print(f"Weights affected >1%: {(relative_error > 0.01).float().mean():.2%}")

13.2.2 The Outlier Problem

Not all weights are equal. Some have outsize impact:

Weight distribution in LLMs:
    │
    │   ████████████████
    │  ██████████████████████
    │ █████████████████████████
    │██████████████████████████████
────┼──────────────────────────────────•──────
    │                                outliers
    └─────────────────────────────────────────→

A few outlier weights carry critical information. Quantizing them aggressively destroys model quality.

Solutions: - Mixed precision: Keep outliers in higher precision - Per-channel scaling: Different scale factors for different channels - GPTQ/AWQ: Optimize quantization to minimize output error

def mixed_precision_quantize(weights, outlier_threshold=3.0):
    """Keep outliers in FP16, quantize the rest to INT4."""
    std = weights.std()
    outlier_mask = weights.abs() > outlier_threshold * std

    # Outliers stay FP16
    outliers = weights[outlier_mask]

    # Rest goes to INT4
    normal = weights[~outlier_mask]
    normal_quantized = quantize_to_int4(normal)

    return outliers, outlier_mask, normal_quantized

13.3 Caching: Exploiting Computational Redundancy

If you compute the same thing twice, the second computation is redundant.

13.3.1 KV Cache in Autoregressive Generation

During LLM generation:

Step 1: Compute K, V for token 1         → Store K₁, V₁
Step 2: Compute K, V for token 2         → Store K₂, V₂
        Attention needs K₁, V₁           → Retrieved from cache
Step 3: Compute K, V for token 3         → Store K₃, V₃
        Attention needs K₁, V₁, K₂, V₂   → Retrieved from cache
...

Without cache: O(n²) key-value computations for n tokens. With cache: O(n) key-value computations.

The redundancy: At step n, we’d recompute K₁…K_{n-1} and V₁…V_{n-1}. But they’re identical to what we computed before.

13.3.2 Prefix Caching

Many prompts share prefixes:

Request 1: "You are a helpful assistant. User: What is Python?"
Request 2: "You are a helpful assistant. User: What is Java?"
Request 3: "You are a helpful assistant. User: What is Rust?"

The system prompt is identical. Computing KV for it three times is redundant.

Prefix caching stores KV for common prefixes:

class PrefixCache:
    def __init__(self, max_size_gb=10):
        self.cache = {}  # hash(prefix) → KV tensors
        self.max_size = max_size_gb * 1e9

    def get_or_compute(self, prefix_tokens, model):
        key = hash(tuple(prefix_tokens))

        if key in self.cache:
            return self.cache[key]  # Cache hit!

        # Cache miss: compute and store
        kv = model.compute_kv(prefix_tokens)
        self.cache[key] = kv
        self.maybe_evict()

        return kv

13.3.3 Semantic Caching

Even more aggressive: cache based on meaning, not exact match.

Request 1: "What's the capital of France?"
Request 2: "What is the capital city of France?"
Request 3: "France's capital?"

These are semantically identical. If we’ve answered one, we can return the cached answer for the others.

def semantic_cache_lookup(query, cache, threshold=0.95):
    """Find semantically similar cached queries."""
    query_embedding = embed(query)

    for cached_query, cached_response in cache.items():
        similarity = cosine_similarity(query_embedding, embed(cached_query))
        if similarity > threshold:
            return cached_response  # Semantic cache hit!

    return None  # Cache miss

13.4 Approximation: Trading Precision for Speed

Sometimes exact computation is redundant for the use case.

13.4.1 Speculative Decoding

Use a small “draft” model to generate candidates, verify with the large model:

Draft model (1B params):  Generates 5 tokens quickly
Large model (70B params): Verifies/corrects in one forward pass

If draft is 80% accurate:
  - Expected accepted tokens per step: ~4
  - Speedup: ~3-4×

The draft model’s approximate answers are “good enough” most of the time. We’re exploiting the redundancy between what the small and large models produce.

13.4.2 Approximate Attention

Full attention: Every token attends to every other token.

But most attention weights are near zero:

Typical attention pattern:

  Q │ ██░░░░░░░░░░░░░░██
  u │ ███░░░░░░░░░░░░███
  e │ ░███░░░░░░░░░░███░
  r │ ░░███░░░░░░░░███░░
  y │ ░░░████░░░░████░░░
    │ ░░░░░█████████░░░░
    │ ░░░░░░░██████░░░░░
    └─────────────────────
              Key

█ = High attention (matters)
░ = Low attention (redundant)

Sparse attention exploits this: only compute the non-zero parts.

13.5 Redundancy in Model Architecture

13.5.1 Weight Sharing

CNNs share weights across spatial positions:

Image convolution:
Same 3×3 kernel applied everywhere
  - Instead of: H×W×9 parameters per layer
  - We have: 9 parameters per layer
  - Compression: H×W × (often >1000×)

This works because images have translation invariance: the same features appear in different locations.

13.5.2 Knowledge Distillation

A student network learns to mimic a teacher:

Teacher (100B params) → [Knowledge] → Student (1B params)

The student captures the “essence” of the teacher’s knowledge. The redundancy is in the teacher—it uses far more parameters than necessary to represent what it knows.

13.5.3 Mixture of Experts Routing

MoE activates only k of N experts per token:

N = 64 experts, k = 2 activated

The 62 non-activated experts are "redundant" for this token.
Total capacity: 64× a dense model
Compute cost: 2× a single expert ≈ dense model

Different tokens need different experts. The redundancy is conditional—what’s redundant depends on the input.

13.6 The Redundancy-Quality Tradeoff

All redundancy exploitation involves a tradeoff:

More Compression ←────────────────────→ Higher Quality
     │                                        │
     │    • 2-bit quantization               │
     │    • Aggressive pruning               │
     │    • Small draft models               │
     │                                        │
     │              Sweet spot               │
     │                 ↓                      │
     │            • 4-bit quant              │
     │            • Moderate pruning         │
     │            • Good draft models        │
     │                                        │
     │                                        │    • 16-bit
     │                                        │    • No pruning
     └────────────────────────────────────────┘

Finding the sweet spot:

Measure baseline quality (perplexity, accuracy, human eval)
Apply compression incrementally
Track quality degradation
Stop when quality/size ratio is optimal

def find_optimal_quantization(model, eval_dataset, quality_threshold=0.95):
    """Find the most aggressive quantization that maintains quality."""
    baseline = evaluate(model, eval_dataset)

    for bits in [8, 6, 4, 3, 2]:
        quantized = quantize(model, bits=bits)
        quality = evaluate(quantized, eval_dataset)

        if quality / baseline >= quality_threshold:
            print(f"{bits}-bit maintains {quality/baseline:.1%} of quality")
            return bits, quantized
        else:
            print(f"{bits}-bit degrades to {quality/baseline:.1%} - too aggressive")

    return 8, quantize(model, bits=8)  # Fall back to INT8

13.7 Key Takeaways

Redundancy Enables Compression

Information < Representation: Neural network weights contain far less information than their bit-width suggests.
Quantization exploits numeric redundancy: 4-bit weights often match 32-bit quality because the extra precision is noise.
Caching exploits computational redundancy: Don’t recompute what you’ve already computed.
Approximation exploits accuracy redundancy: “Good enough” is often good enough.
The tradeoff is measurable: Quantify quality vs. compression and find the sweet spot.

13.8 The Hardware Connection

Redundancy’s value depends on hardware characteristics:

Memory bandwidth (from Bandwidth): Quantization from FP16 to INT4 reduces memory traffic by 4×. For memory-bound operations (most LLM inference), this translates directly to 4× throughput improvement.

Integer tensor cores: Modern GPUs have specialized INT8/INT4 hardware that is 2-4× faster than FP16 tensor cores. Quantization unlocks these.

Cache hierarchy (from Memory Hierarchy): KV cache fits in GPU SRAM for shorter sequences, enabling fast access. When the cache spills to HBM, latency increases—making cache size reduction via quantization or MQA critical.

The Connection:

Mathematical Property     Hardware Constraint       Exploitation
───────────────────────────────────────────────────────────────
Information < Bits    ×   Memory bandwidth     →   Quantization
Repeated computation  ×   HBM latency          →   KV caching
Approximate suffices  ×   Precision hierarchy   →   Mixed precision

13.9 Exercises

Exercise 1: Information Content of Weights

A neural network layer has weight matrix \(W \in \mathbb{R}^{4096 \times 4096}\) stored in FP32 (4 bytes per element).

What is the total storage in bytes?
If SVD reveals that 99% of the variance is captured by the top 256 singular values, what is the “effective” information content using the low-rank representation?
What compression ratio does this imply?

Solution: (a) 4096 × 4096 × 4 = 67 MB. (b) Low-rank: (4096 × 256 + 256 + 256 × 4096) × 4 ≈ 8.4 MB. (c) ~8× compression. In practice, quantization achieves similar ratios through a different mechanism.

Exercise 2: KV Cache Memory Budget

You’re serving LLaMA-2 7B (32 layers, 32 heads, head_dim=128, FP16) with a 4096-token context window on a 24GB GPU. Model weights in INT4 use ~3.5GB.

How much KV cache does each concurrent request need?
How many concurrent requests can you serve?
If you quantize the KV cache to INT8, how does the answer change?

Solution: (a) 2 × 32 × 4096 × (32 × 128) × 2 bytes = 2.1 GB per request. (b) Available: 24 - 3.5 - 2 (overhead) = 18.5 GB → ~8 concurrent requests. (c) INT8 KV cache: 1.05 GB per request → ~17 concurrent requests.

Exercise 3: When Redundancy Doesn’t Help

Give an example of a computation where the result contains as much information as the input (i.e., no redundancy to exploit). What does this imply about the possibility of compression?

Solution: Encryption (e.g., AES): the ciphertext contains as much entropy as the input + key. Lossless compression cannot reduce it. Similarly, random data cannot be compressed. This is why quantization works for neural networks (structured, redundant weights) but wouldn’t work for random matrices.

13.10 Further Reading

Shannon, C. E. (1948). “A Mathematical Theory of Communication”
Dettmers et al. (2022). “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale”
Frantar et al. (2022). “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”
Lin et al. (2023). “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”
Cover & Thomas (2006). “Elements of Information Theory” (the textbook on information-theoretic limits)

Try It Yourself

The accompanying notebook walks through:

Computing weight entropy for real models
Implementing basic quantization and measuring quality impact
Analyzing KV cache memory requirements
Comparing compression ratios across techniques

Open In Colab

Next: Symmetry →

[1]

C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.