18 Investigation: Quantization

Why 4 Bits Can Do the Work of 32

Take a 70-billion-parameter model. Reduce every weight from 32 bits to 4 bits.

You’ve just thrown away 87.5% of the information.

The model still works. How?

Property Spotlight: Redundancy

This chapter is a case study in redundancy—the fifth property from our Algebraic Framework.

When the information content of data is less than its representation size, we have redundancy. Neural network weights contain far less information than their 32-bit encoding suggests—a principle rooted in Shannon’s information theory [1]. Quantization exploits this gap.

This chapter investigates why neural networks tolerate aggressive quantization and how to push further without breaking. Key methods include LLM.int8() [2], GPTQ [3], and AWQ [4].

18.1 The Paradox

Neural network weights are typically stored as 32-bit or 16-bit floating-point numbers. That precision seems necessary—after all, training involves subtle gradient updates.

But at inference time, something remarkable happens:

LLaMA-2 70B

FP16:  140 GB memory, 100 tokens/sec
INT8:   70 GB memory, 150 tokens/sec
INT4:   35 GB memory, 200 tokens/sec

Perplexity change from FP16 → INT4: +0.3 (barely noticeable)

We reduced precision by 4×, memory by 4×, and the model barely noticed.

This chapter investigates why.

18.2 The Mathematics of Quantization

Quantization maps continuous values to a discrete set:

\[q(x) = \text{round}\left(\frac{x - z}{s}\right)\]

where: - \(x\) is the original value - \(s\) is the scale factor - \(z\) is the zero point - The result is an integer in a fixed range (e.g., -128 to 127 for INT8)

Dequantization recovers an approximation:

\[\hat{x} = s \cdot q(x) + z\]

The quantization error is:

\[\epsilon = x - \hat{x} = x - s \cdot \text{round}\left(\frac{x - z}{s}\right) - z\]

For uniform quantization with \(n\) bits, the scale is:

\[s = \frac{x_{max} - x_{min}}{2^n - 1}\]

import numpy as np

def quantize(x, bits=8):
    """Quantize to n-bit integer."""
    x_min, x_max = x.min(), x.max()
    scale = (x_max - x_min) / (2**bits - 1)
    zero_point = x_min

    q = np.round((x - zero_point) / scale).astype(int)
    q = np.clip(q, 0, 2**bits - 1)

    return q, scale, zero_point

def dequantize(q, scale, zero_point):
    """Dequantize back to float."""
    return q.astype(float) * scale + zero_point

# Example
x = np.random.randn(1000) * 0.1  # Typical weight distribution
q, s, z = quantize(x, bits=8)
x_hat = dequantize(q, s, z)

error = np.abs(x - x_hat)
print(f"Max error: {error.max():.6f}")
print(f"Mean error: {error.mean():.6f}")
print(f"Relative error: {(error / np.abs(x).mean()).mean() * 100:.2f}%")

# Output:
# Max error: 0.000780
# Mean error: 0.000195
# Relative error: 2.44%

18.3 Why Neural Networks Tolerate Quantization

Several factors explain the surprising robustness:

18.3.1 1. Training Noise Exceeds Quantization Noise

Neural networks are trained with stochastic gradient descent: - Mini-batches introduce variance - Dropout injects noise - Data augmentation varies inputs

The model learns to be robust to perturbations larger than quantization noise.

# Typical sources of noise during training

# SGD mini-batch variance (batch size 32)
sgd_noise = np.std(gradients_batch32 - gradients_full)  # ~0.01-0.1

# Dropout noise (p=0.1)
dropout_noise = 0.1  # 10% of activations zeroed

# INT8 quantization noise
int8_noise = scale / 2  # ~0.001 for typical weights

# int8_noise << sgd_noise
# The network already handles much larger perturbations

18.3.2 2. Flat Minima and Robustness

Networks converge to flat regions of the loss landscape where the loss is insensitive to small weight changes.

Loss landscape visualization:

Sharp minimum:          Flat minimum:
     ╱╲                     ────
    ╱  ╲                 ╱      ╲
   ╱    ╲               ╱        ╲
  ╱      ╲             ╱          ╲

Quantization = small perturbation.
Flat minimum → small loss change.

Modern training techniques (large batches, weight decay, learning rate schedules) encourage convergence to flat minima.

18.3.3 3. Overparameterization Creates Redundancy

A 70B parameter model has 70 billion numbers. But its “effective capacity” is much lower: - Intrinsic dimensionality is often <1% of parameters - Many weight configurations produce identical behavior - The network is robust to removing or perturbing individual weights

This redundancy means quantization removes information the network doesn’t need.

18.3.4 4. The Lipschitz Property

Well-trained networks have bounded sensitivity to input perturbations:

\[\|f(x + \delta) - f(x)\| \leq L \cdot \|\delta\|\]

where \(L\) is the Lipschitz constant. Weight quantization is equivalent to a small perturbation in \(f\), and Lipschitz continuity bounds the output change.

18.4 The Outlier Problem

Not all quantization is easy. Large language models have a specific challenge: outliers.

18.4.1 Emergent Features

As LLMs scale, they develop “emergent features”—activations that are 10-100× larger than typical:

# Activation statistics from LLaMA-65B

def analyze_activations(model, data):
    activations = []
    for layer in model.layers:
        acts = layer.forward(data)
        activations.append({
            'mean': acts.abs().mean().item(),
            'max': acts.abs().max().item(),
            'ratio': acts.abs().max().item() / acts.abs().mean().item()
        })
    return activations

# Typical results:
# Layer 0:  mean=0.42, max=8.3,   ratio=19.8
# Layer 15: mean=0.38, max=47.2,  ratio=124.2  <- Outlier!
# Layer 30: mean=0.41, max=112.8, ratio=275.1  <- Extreme outlier!

18.4.2 Why Outliers Break Quantization

Uniform quantization uses the same scale for all values:

\[s = \frac{x_{max} - x_{min}}{2^n - 1}\]

If \(x_{max} = 100\) but most values are near 0.4, then: - Scale is ~0.8 for INT8 - Most values map to only 1-2 quantization levels - Effective precision for typical values: ~1 bit

Value distribution with outliers:

         Typical values      Outliers
         ↓                   ↓
─────────┼───────────────────────────────────────┼───
         0                                     100

INT8 levels: ████████████████████████████████████████

Most values crammed into few levels → severe precision loss

18.4.3 Solutions to the Outlier Problem

Per-channel quantization: Different scale per output channel.

def per_channel_quantize(weight, bits=8):
    """Quantize each output channel separately."""
    n_channels = weight.shape[0]
    scales = np.zeros(n_channels)
    zero_points = np.zeros(n_channels)
    q = np.zeros_like(weight, dtype=int)

    for c in range(n_channels):
        channel = weight[c]
        q[c], scales[c], zero_points[c] = quantize(channel, bits)

    return q, scales, zero_points

# Per-channel handles the case where different channels have different ranges

Mixed precision: Keep outlier channels in FP16, quantize the rest.

def mixed_precision_quantize(weight, outlier_threshold=6.0, bits=4):
    """Keep outliers in FP16, quantize rest to 4-bit."""
    # Find outlier channels
    channel_max = np.abs(weight).max(axis=1)
    median_max = np.median(channel_max)
    outliers = channel_max > outlier_threshold * median_max

    # Quantize non-outliers
    q_weight = np.zeros_like(weight, dtype=int)
    scales = np.zeros(weight.shape[0])

    for c in range(weight.shape[0]):
        if not outliers[c]:
            q_weight[c], scales[c], _ = quantize(weight[c], bits)

    return q_weight, scales, outliers, weight[outliers]  # Keep outliers in FP16

18.5 Modern Quantization Techniques

18.5.1 GPTQ: Optimal Brain Quantization

GPTQ (Generative Pretrained Transformer Quantization) uses an insight from optimal brain damage: quantize weights in order of importance, updating remaining weights to compensate.

def gptq_quantize_layer(W, H, bits=4):
    """
    GPTQ: Quantize weights using inverse Hessian information.

    W: Weight matrix to quantize
    H: Hessian (approximated as X^T X from calibration data)

    The key insight: use the inverse Hessian to optimally distribute
    quantization error to unquantized weights, minimizing the
    second-order approximation of the output loss.
    """
    n, m = W.shape
    Q = np.zeros_like(W)  # Quantized weights

    # Compute the inverse Hessian (with damping for numerical stability)
    damp = 0.01 * np.mean(np.diag(H))
    H_inv = np.linalg.inv(H + damp * np.eye(m))

    # Process columns in order
    for i in range(m):
        # Current column
        w = W[:, i]

        # Quantize this column
        q, scale, zp = quantize(w, bits)
        Q[:, i] = dequantize(q, scale, zp)

        # Compute quantization error
        error = w - Q[:, i]

        # Update remaining columns to compensate using inverse Hessian
        # This is the key insight from Optimal Brain Quantization:
        # distribute the error weighted by H_inv to minimize loss
        for j in range(i + 1, m):
            W[:, j] -= error * H_inv[i, j] / H_inv[i, i]

    return Q

GPTQ achieves near-lossless 4-bit quantization by distributing quantization error across unquantized weights.

18.5.2 AWQ: Activation-Aware Weight Quantization

AWQ observes that some weights are more important than others—specifically, weights that interact with large activations:

def awq_quantize(W, activations, bits=4):
    """
    AWQ: Scale weights before quantization based on activation magnitude.
    """
    # Compute per-channel activation magnitude
    act_scale = activations.abs().mean(dim=0)

    # Scale up important weights (those with high activation)
    # This gives them more quantization bins
    importance = act_scale / act_scale.mean()
    scaled_W = W * importance.unsqueeze(0)

    # Quantize the scaled weights
    Q, scales, zp = quantize(scaled_W, bits)

    # Compensate in scale factors
    # Final: Q * scale / importance = W_quantized
    adjusted_scales = scales / importance

    return Q, adjusted_scales

AWQ’s insight: if a weight interacts with large activations, its quantization error gets amplified. Protect those weights.

18.5.3 SmoothQuant: Migrating Difficulty

SmoothQuant observes that activations are harder to quantize than weights (more outliers). Solution: mathematically migrate the difficulty from activations to weights:

\[Y = XW = (X \cdot \text{diag}(s)) \cdot (\text{diag}(s)^{-1} \cdot W) = \hat{X}\hat{W}\]

where \(s\) is chosen to balance the quantization difficulty:

def smooth_quant_transform(X, W, alpha=0.5):
    """
    Transform to balance quantization difficulty between X and W.

    alpha controls the migration:
      alpha=0: all difficulty stays in X
      alpha=1: all difficulty moves to W
      alpha=0.5: balanced (typically best)
    """
    # Compute per-channel scales
    act_scales = X.abs().max(dim=0).values  # Activation ranges
    weight_scales = W.abs().max(dim=0).values  # Weight ranges

    # Smooth factor
    s = (act_scales ** alpha) / (weight_scales ** (1 - alpha))

    # Apply transformation
    X_smooth = X / s
    W_smooth = W * s.unsqueeze(0)

    return X_smooth, W_smooth

18.6 Hardware Acceleration

Quantization isn’t just about memory—it’s about compute speed.

18.6.1 Integer Tensor Cores

Modern GPUs have specialized hardware for integer matrix multiply:

NVIDIA A100 (total chip throughput):
  FP32: 19.5 TFLOPS
  FP16 (Tensor Core): 312 TFLOPS (16× faster)
  INT8 (Tensor Core): 624 TOPS (32× faster than FP32)

The speedup compounds with memory savings:
  FP16 → INT8: 2× memory reduction × 2× compute speedup = 4× throughput

But there’s a catch: quantization overhead.

18.6.2 The Quantization-Dequantization Overhead

Quantized compute requires: 1. Dequantize inputs (or keep in integer domain) 2. Compute in integer 3. Requantize outputs (if chaining quantized layers)

def quantized_matmul(Q_W, scale_W, X, scale_X):
    """
    Quantized matrix multiply with scale bookkeeping.
    """
    # Quantize input
    Q_X = quantize(X / scale_X)

    # Integer matmul (fast on GPU)
    Q_Y = Q_X @ Q_W  # INT8 × INT8 → INT32

    # Dequantize output
    Y = Q_Y * (scale_X * scale_W)

    return Y

The overhead is amortized over large matrix multiplies, making quantization most beneficial for: - Large models (more compute per overhead) - Memory-bound operations (memory savings dominate) - Batched inference (amortize per-batch overhead)

18.6.3 Where Quantization Helps Most

Inference regime analysis:

Batch size 1 (latency-sensitive):
  Memory-bound → quantization helps via memory reduction
  Speedup: ~2-4× from INT4 vs FP16

Large batch (throughput):
  Compute-bound → quantization helps via faster compute
  Speedup: ~2-8× depending on model and hardware

Training:
  Gradient precision matters more
  Typically FP16 or FP8, not INT4

18.7 Practical Quantization

18.7.1 When to Use Each Technique

Technique	Bits	Quality	Speed	Best For
FP16	16	Baseline	Baseline	Training, high-quality
INT8 (naive)	8	Good	2×	Simple deployment
INT8 (smooth)	8	Near FP16	2×	Production LLM serving
INT4 (GPTQ)	4	Good	3-4×	Memory-constrained
INT4 (AWQ)	4	Better	3-4×	Quality-sensitive apps
INT2-3	2-3	Degraded	4-6×	Extreme compression

18.7.2 Quality Evaluation

Always measure perplexity (or task-specific metrics) before and after:

def evaluate_quantization(model, quantized_model, eval_data):
    """Compare original and quantized model quality."""

    # Perplexity (language modeling)
    ppl_original = compute_perplexity(model, eval_data)
    ppl_quantized = compute_perplexity(quantized_model, eval_data)

    print(f"Original perplexity: {ppl_original:.2f}")
    print(f"Quantized perplexity: {ppl_quantized:.2f}")
    print(f"Perplexity increase: {(ppl_quantized - ppl_original):.2f} ({(ppl_quantized/ppl_original - 1)*100:.1f}%)")

    # Rule of thumb:
    # < 1% perplexity increase: Excellent
    # 1-5% increase: Good for most applications
    # > 5% increase: May affect downstream tasks

# Typical results for LLaMA-2 70B:
# FP16: 3.12 perplexity
# INT8: 3.14 perplexity (+0.6%)
# INT4 (GPTQ): 3.18 perplexity (+1.9%)
# INT4 (AWQ): 3.15 perplexity (+1.0%)

18.7.3 Common Pitfalls

Quantizing without calibration data: Results in poor scale estimates
Ignoring outliers: Uniform quantization fails on LLMs
Wrong granularity: Per-tensor is too coarse; per-channel is usually needed
Evaluating on wrong metric: Perplexity may hide task-specific degradation

18.8 The Derivation Pattern

How would you discover modern quantization if it didn’t exist?

Observe: Models work fine at lower precision during inference
Measure: Find that 8-bit is mostly fine, but 4-bit has outlier problems
Analyze: Discover that outliers are concentrated in specific channels/layers
Design solutions:
- Per-channel scales (different ranges per channel)
- Mixed precision (keep outliers in FP16)
- Error compensation (GPTQ)
- Preemptive scaling (AWQ, SmoothQuant)
Validate: Measure perplexity and downstream tasks

The theme: understand the failure mode (outliers), then design around it.

18.9 Property Audit

Property	Applies?	How it’s exploited
Associativity	No	Quantization doesn’t change computation order
Separability	Partial	Per-channel quantization decomposes the problem per output channel
Locality	Partial	Smaller datatypes improve cache efficiency
Sparsity	Partial	Mixed-precision keeps outliers in FP16 (sparse set of channels)
Redundancy	Primary	Neural network weights contain far less information than 32-bit encoding — the gap between information content and representation size is what quantization exploits
Symmetry	No	No symmetry structure involved

Dominant property: Redundancy — quantization works because networks are overparameterized and their weight distributions have low effective entropy.

18.10 Key Takeaways

Neural networks are robust to quantization: Training noise, flat minima, and overparameterization create tolerance
Outliers are the challenge: LLMs have emergent features 100× larger than typical values
Solutions exist: GPTQ, AWQ, SmoothQuant each address outliers differently
Hardware matters: INT8 tensor cores provide 4-8× speedup over FP16
Always measure: Perplexity changes hide task-specific degradation

18.11 Advanced Quantization Methods

18.11.1 GGUF and the llama.cpp Ecosystem

GGUF (GPT-Generated Unified Format) is the standard format for running quantized LLMs on consumer hardware, particularly CPUs.

GGUF Quantization Types:

Type      Bits    Quality     Size (7B model)    Use Case
────────────────────────────────────────────────────────────
Q8_0      8       Excellent   7.2 GB             Best quality
Q6_K      6       Very Good   5.5 GB             Good balance
Q5_K_M    5       Good        4.8 GB             Recommended default
Q4_K_M    4       Good        4.1 GB             Memory-limited
Q4_0      4       Decent      3.8 GB             Fast, lower quality
Q3_K_M    3       Acceptable  3.3 GB             Extreme compression
Q2_K      2       Degraded    2.7 GB             Experimental

K-Quants: The _K_ variants use non-uniform quantization with importance weighting:

# Conceptual K-quant approach
def k_quant_block(weights, block_size=32):
    """
    K-quants use mixed precision within blocks.

    More important weights get more bits.
    """
    importance = compute_importance(weights)

    # High-importance weights: more bits
    # Low-importance weights: fewer bits
    bits_allocation = allocate_bits(importance, target_avg_bits=4)

    quantized = []
    for w, bits in zip(weights, bits_allocation):
        quantized.append(quantize_to_bits(w, bits))

    return quantized

Using GGUF models:

# With llama-cpp-python
from llama_cpp import Llama

# Load quantized model
llm = Llama(
    model_path="./llama-7b.Q4_K_M.gguf",
    n_ctx=4096,
    n_gpu_layers=35  # Offload layers to GPU
)

# Generate
output = llm("Explain quantum computing:", max_tokens=256)

When to use GGUF: - CPU inference (optimized for AVX2/AVX-512) - Consumer hardware (gaming GPUs, Mac) - Mixed CPU/GPU execution - Easy deployment (single file, no dependencies)

18.11.2 HQQ: Half-Quadratic Quantization

HQQ achieves high-quality quantization without calibration data:

# HQQ key insight: optimize quantization parameters directly
def hqq_quantize(W, bits=4, axis=1):
    """
    HQQ: Zero-shot quantization via half-quadratic optimization.

    No calibration data needed!
    """
    # Initialize scale and zero point
    scale = (W.max(axis=axis) - W.min(axis=axis)) / (2**bits - 1)
    zero = W.min(axis=axis)

    for iteration in range(num_iterations):
        # Quantize with current parameters
        Q = torch.round((W - zero) / scale).clamp(0, 2**bits - 1)
        W_hat = Q * scale + zero

        # Optimize scale and zero to minimize ||W - W_hat||^2
        # This is a half-quadratic splitting problem
        scale, zero = optimize_params(W, Q, scale, zero)

    return Q, scale, zero

# Usage with HQQ library
from hqq.core.quantize import HQQLinear

# Replace linear layers with HQQ quantized versions
quantized_layer = HQQLinear(
    linear_layer,
    quant_config={'weight_quant_params': {'nbits': 4, 'group_size': 64}}
)

HQQ advantages: 1. No calibration data required 2. Fast quantization (minutes, not hours) 3. Competitive quality with GPTQ/AWQ 4. Good for dynamic quantization scenarios

18.11.3 AQLM: Additive Quantization for LLMs

AQLM uses vector quantization with learned codebooks:

# AQLM: Instead of scalar quantization, use vector quantization
def aqlm_concept(W, num_codebooks=2, codebook_size=256):
    """
    AQLM: Represent weight vectors as sums of codebook entries.

    W[i] ≈ C1[idx1[i]] + C2[idx2[i]] + ...
    """
    # Learn codebooks on calibration data
    codebooks = []
    for c in range(num_codebooks):
        codebook = learn_codebook(W, size=codebook_size)
        codebooks.append(codebook)
        # Subtract codebook contribution for next round
        W = W - lookup(codebook, W)

    # Quantized representation: just indices
    indices = [[find_nearest(W_row, cb) for cb in codebooks] for W_row in W]

    return indices, codebooks

# Dequantization: sum codebook entries
def aqlm_dequantize(indices, codebooks):
    W_reconstructed = 0
    for idx, cb in zip(indices, codebooks):
        W_reconstructed += cb[idx]
    return W_reconstructed

AQLM achieves extreme compression:

Model        Method    Bits    Perplexity
─────────────────────────────────────────
LLaMA-7B     FP16      16      5.68
LLaMA-7B     GPTQ      4       5.85 (+3.0%)
LLaMA-7B     AQLM      2       6.12 (+7.7%)
LLaMA-7B     AQLM      1.5     6.89 (+21%)

AQLM enables <2-bit quantization with reasonable quality.

18.11.4 SpQR: Sparse Quantization with Outliers

SpQR handles outliers by keeping them sparse in high precision:

def spqr_quantize(W, bits=4, outlier_fraction=0.01):
    """
    SpQR: Quantize most weights, keep outliers in FP16.

    Outliers stored in sparse format for efficiency.
    """
    # Identify outliers (top fraction by magnitude)
    threshold = torch.quantile(W.abs(), 1 - outlier_fraction)
    outlier_mask = W.abs() > threshold

    # Store outliers in sparse format
    outliers_sparse = to_sparse(W[outlier_mask])

    # Quantize the rest
    W_normal = W.clone()
    W_normal[outlier_mask] = 0  # Zero out outliers
    Q, scale, zp = quantize(W_normal, bits)

    return Q, scale, zp, outliers_sparse, outlier_mask

def spqr_dequantize(Q, scale, zp, outliers_sparse, outlier_mask):
    W = dequantize(Q, scale, zp)
    W[outlier_mask] = from_sparse(outliers_sparse)
    return W

SpQR insight: 1% of weights in FP16 + 99% in INT3 ≈ 3.1 bits effective with minimal quality loss.

18.12 Quantization-Aware Training (QAT)

18.12.1 Post-Training vs Training-Aware

All methods above are post-training quantization (PTQ)—applied after training.

Quantization-aware training (QAT) simulates quantization during training:

class QATLinear(nn.Module):
    """Linear layer with quantization-aware training."""

    def __init__(self, in_features, out_features, bits=4):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features))
        self.bits = bits

        # Learnable quantization parameters
        self.scale = nn.Parameter(torch.ones(1))
        self.zero_point = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        if self.training:
            # Fake quantization: quantize then dequantize
            # Gradients flow through using straight-through estimator
            W_q = fake_quantize(self.weight, self.scale, self.zero_point, self.bits)
        else:
            W_q = real_quantize(self.weight, self.scale, self.zero_point, self.bits)

        return F.linear(x, W_q)

def fake_quantize(W, scale, zero_point, bits):
    """
    Fake quantization with straight-through estimator.

    Forward: quantize → dequantize (simulates quantization)
    Backward: gradients pass through unchanged
    """
    # Quantize
    W_int = torch.round(W / scale + zero_point)
    W_int = W_int.clamp(0, 2**bits - 1)

    # Dequantize
    W_fake = (W_int - zero_point) * scale

    # Straight-through estimator: use W_fake for forward, W.grad for backward
    return W + (W_fake - W).detach()

18.12.2 QAT Workflow

def qat_training(model, train_loader, epochs=3):
    """
    Quantization-aware training workflow.
    """
    # 1. Replace layers with QAT versions
    model = prepare_qat(model, bits=4)

    # 2. Train with fake quantization
    optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)

    for epoch in range(epochs):
        for batch in train_loader:
            loss = model(batch)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

    # 3. Convert to actual quantized model
    quantized_model = convert_to_quantized(model)

    return quantized_model

# PyTorch native QAT
import torch.quantization as quant

model = MyModel()

# Prepare for QAT
model.qconfig = quant.get_default_qat_qconfig('fbgemm')
model_prepared = quant.prepare_qat(model)

# Train...
train(model_prepared)

# Convert to quantized
model_quantized = quant.convert(model_prepared)

18.12.3 When to Use QAT

PTQ (Post-Training Quantization):
  ✓ No training data/compute needed
  ✓ Fast (minutes to hours)
  ✓ Good for 8-bit and 4-bit
  ✗ Quality degrades at extreme compression

QAT (Quantization-Aware Training):
  ✓ Best quality at low bit-widths
  ✓ Model adapts to quantization
  ✓ Essential for 2-3 bit quantization
  ✗ Requires training (expensive)
  ✗ Needs training data

Decision: - INT8: PTQ is usually sufficient - INT4: PTQ (GPTQ/AWQ) often works, try QAT if quality insufficient - INT2-3: QAT almost always needed

18.13 Choosing the Right Method

flowchart TD
    A[Start] --> B{Need < 2 bits?}
    B -->|Yes| C[AQLM or QAT]
    B -->|No| D{Have calibration data?}
    D -->|No| E[HQQ<br/>zero-shot]
    D -->|Yes| F{Target platform?}
    F -->|CPU/Mac| G[GGUF<br/>llama.cpp]
    F -->|AMD GPU| H[GGUF or AWQ]
    F -->|NVIDIA GPU| I{Quality sensitivity?}
    I -->|High| J[AWQ or GPTQ<br/>group_size=32]
    I -->|Medium| K[GPTQ<br/>group_size=128]
    I -->|Low| L[GPTQ<br/>group_size=-1]

    style C fill:#e0f2fe,stroke:#0284c7
    style E fill:#fef3c7,stroke:#d97706
    style G fill:#dcfce7,stroke:#16a34a
    style H fill:#dcfce7,stroke:#16a34a
    style J fill:#f3e8ff,stroke:#9333ea
    style K fill:#f3e8ff,stroke:#9333ea
    style L fill:#f3e8ff,stroke:#9333ea

Decision tree for choosing a quantization method

For memory-constrained scenarios:

flowchart LR
    A{Memory extremely limited?} -->|Yes| B[SpQR 3-bit<br/>or AQLM 2-bit]
    A -->|No| C[AWQ or GPTQ<br/>4-bit]

    style B fill:#fee2e2,stroke:#dc2626
    style C fill:#dcfce7,stroke:#16a34a

Memory-constrained quantization choices

18.14 Connections

Factoring: Quantization and low-rank share a theme—the model doesn’t need all its precision

LoRA: QLoRA combines quantization with LoRA for efficient fine-tuning

Bandwidth: Quantization’s benefit is often memory bandwidth, not compute

Try It Yourself

The accompanying notebook walks through:

Implementing uniform quantization from scratch
Visualizing the outlier problem in LLMs
Comparing GPTQ, AWQ, HQQ, and naive quantization
Converting models to GGUF format
Measuring perplexity impact

Open In Colab

18.15 Further Reading

Dettmers et al. (2022). “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale”
Frantar et al. (2022). “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”
Lin et al. (2023). “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”
Xiao et al. (2023). “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models”
Badri & Shaji (2023). “HQQ: Half-Quadratic Quantization”
Egiazarian et al. (2024). “AQLM: Extreme Compression of Large Language Models via Additive Quantization”
Dettmers & Zettlemoyer (2023). “SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression”
GGUF Format Specification

[1]

C. E. Shannon, “A mathematical theory of communication,” Bell System Technical Journal, vol. 27, no. 3, pp. 379–423, 1948.

[2]

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “LLM.int8(): 8-bit matrix multiplication for transformers at scale,” Advances in Neural Information Processing Systems, 2022.

[3]

E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “GPTQ: Accurate post-training quantization for generative pre-trained transformers,” arXiv preprint arXiv:2210.17323, 2022.

[4]

J. Lin, J. Tang, H. Tang, S. Yang, X. Dang, and S. Han, “AWQ: Activation-aware weight quantization for LLM compression and acceleration,” arXiv preprint arXiv:2306.00978, 2023.