flowchart TD
A[Start] --> B{Need < 2 bits?}
B -->|Yes| C[AQLM or QAT]
B -->|No| D{Have calibration data?}
D -->|No| E[HQQ<br/>zero-shot]
D -->|Yes| F{Target platform?}
F -->|CPU/Mac| G[GGUF<br/>llama.cpp]
F -->|AMD GPU| H[GGUF or AWQ]
F -->|NVIDIA GPU| I{Quality sensitivity?}
I -->|High| J[AWQ or GPTQ<br/>group_size=32]
I -->|Medium| K[GPTQ<br/>group_size=128]
I -->|Low| L[GPTQ<br/>group_size=-1]
style C fill:#e0f2fe,stroke:#0284c7
style E fill:#fef3c7,stroke:#d97706
style G fill:#dcfce7,stroke:#16a34a
style H fill:#dcfce7,stroke:#16a34a
style J fill:#f3e8ff,stroke:#9333ea
style K fill:#f3e8ff,stroke:#9333ea
style L fill:#f3e8ff,stroke:#9333ea
18 Investigation: Quantization
Why 4 Bits Can Do the Work of 32
Take a 70-billion-parameter model. Reduce every weight from 32 bits to 4 bits.
You’ve just thrown away 87.5% of the information.
The model still works. How?
This chapter is a case study in redundancy—the fifth property from our Algebraic Framework.
When the information content of data is less than its representation size, we have redundancy. Neural network weights contain far less information than their 32-bit encoding suggests. Quantization exploits this gap.
This chapter investigates why neural networks tolerate aggressive quantization and how to push further without breaking.
18.1 The Paradox
Neural network weights are typically stored as 32-bit or 16-bit floating-point numbers. That precision seems necessary—after all, training involves subtle gradient updates.
But at inference time, something remarkable happens:
LLaMA-2 70B
FP16: 140 GB memory, 100 tokens/sec
INT8: 70 GB memory, 150 tokens/sec
INT4: 35 GB memory, 200 tokens/sec
Perplexity change from FP16 → INT4: +0.3 (barely noticeable)
We reduced precision by 4×, memory by 4×, and the model barely noticed.
This chapter investigates why.
18.2 The Mathematics of Quantization
Quantization maps continuous values to a discrete set:
\[q(x) = \text{round}\left(\frac{x - z}{s}\right)\]
where: - \(x\) is the original value - \(s\) is the scale factor - \(z\) is the zero point - The result is an integer in a fixed range (e.g., -128 to 127 for INT8)
Dequantization recovers an approximation:
\[\hat{x} = s \cdot q(x) + z\]
The quantization error is:
\[\epsilon = x - \hat{x} = x - s \cdot \text{round}\left(\frac{x - z}{s}\right) - z\]
For uniform quantization with \(n\) bits, the scale is:
\[s = \frac{x_{max} - x_{min}}{2^n - 1}\]
import numpy as np
def quantize(x, bits=8):
"""Quantize to n-bit integer."""
x_min, x_max = x.min(), x.max()
scale = (x_max - x_min) / (2**bits - 1)
zero_point = x_min
q = np.round((x - zero_point) / scale).astype(int)
q = np.clip(q, 0, 2**bits - 1)
return q, scale, zero_point
def dequantize(q, scale, zero_point):
"""Dequantize back to float."""
return q.astype(float) * scale + zero_point
# Example
x = np.random.randn(1000) * 0.1 # Typical weight distribution
q, s, z = quantize(x, bits=8)
x_hat = dequantize(q, s, z)
error = np.abs(x - x_hat)
print(f"Max error: {error.max():.6f}")
print(f"Mean error: {error.mean():.6f}")
print(f"Relative error: {(error / np.abs(x).mean()).mean() * 100:.2f}%")
# Output:
# Max error: 0.000780
# Mean error: 0.000195
# Relative error: 2.44%18.3 Why Neural Networks Tolerate Quantization
Several factors explain the surprising robustness:
18.3.1 1. Training Noise Exceeds Quantization Noise
Neural networks are trained with stochastic gradient descent: - Mini-batches introduce variance - Dropout injects noise - Data augmentation varies inputs
The model learns to be robust to perturbations larger than quantization noise.
# Typical sources of noise during training
# SGD mini-batch variance (batch size 32)
sgd_noise = np.std(gradients_batch32 - gradients_full) # ~0.01-0.1
# Dropout noise (p=0.1)
dropout_noise = 0.1 # 10% of activations zeroed
# INT8 quantization noise
int8_noise = scale / 2 # ~0.001 for typical weights
# int8_noise << sgd_noise
# The network already handles much larger perturbations18.3.2 2. Flat Minima and Robustness
Networks converge to flat regions of the loss landscape where the loss is insensitive to small weight changes.
Loss landscape visualization:
Sharp minimum: Flat minimum:
╱╲ ────
╱ ╲ ╱ ╲
╱ ╲ ╱ ╲
╱ ╲ ╱ ╲
Quantization = small perturbation.
Flat minimum → small loss change.
Modern training techniques (large batches, weight decay, learning rate schedules) encourage convergence to flat minima.
18.3.3 3. Overparameterization Creates Redundancy
A 70B parameter model has 70 billion numbers. But its “effective capacity” is much lower: - Intrinsic dimensionality is often <1% of parameters - Many weight configurations produce identical behavior - The network is robust to removing or perturbing individual weights
This redundancy means quantization removes information the network doesn’t need.
18.3.4 4. The Lipschitz Property
Well-trained networks have bounded sensitivity to input perturbations:
\[\|f(x + \delta) - f(x)\| \leq L \cdot \|\delta\|\]
where \(L\) is the Lipschitz constant. Weight quantization is equivalent to a small perturbation in \(f\), and Lipschitz continuity bounds the output change.
18.4 The Outlier Problem
Not all quantization is easy. Large language models have a specific challenge: outliers.
18.4.1 Emergent Features
As LLMs scale, they develop “emergent features”—activations that are 10-100× larger than typical:
# Activation statistics from LLaMA-65B
def analyze_activations(model, data):
activations = []
for layer in model.layers:
acts = layer.forward(data)
activations.append({
'mean': acts.abs().mean().item(),
'max': acts.abs().max().item(),
'ratio': acts.abs().max().item() / acts.abs().mean().item()
})
return activations
# Typical results:
# Layer 0: mean=0.42, max=8.3, ratio=19.8
# Layer 15: mean=0.38, max=47.2, ratio=124.2 <- Outlier!
# Layer 30: mean=0.41, max=112.8, ratio=275.1 <- Extreme outlier!18.4.2 Why Outliers Break Quantization
Uniform quantization uses the same scale for all values:
\[s = \frac{x_{max} - x_{min}}{2^n - 1}\]
If \(x_{max} = 100\) but most values are near 0.4, then: - Scale is ~0.8 for INT8 - Most values map to only 1-2 quantization levels - Effective precision for typical values: ~1 bit
Value distribution with outliers:
Typical values Outliers
↓ ↓
─────────┼───────────────────────────────────────┼───
0 100
INT8 levels: ████████████████████████████████████████
Most values crammed into few levels → severe precision loss
18.4.3 Solutions to the Outlier Problem
Per-channel quantization: Different scale per output channel.
def per_channel_quantize(weight, bits=8):
"""Quantize each output channel separately."""
n_channels = weight.shape[0]
scales = np.zeros(n_channels)
zero_points = np.zeros(n_channels)
q = np.zeros_like(weight, dtype=int)
for c in range(n_channels):
channel = weight[c]
q[c], scales[c], zero_points[c] = quantize(channel, bits)
return q, scales, zero_points
# Per-channel handles the case where different channels have different rangesMixed precision: Keep outlier channels in FP16, quantize the rest.
def mixed_precision_quantize(weight, outlier_threshold=6.0, bits=4):
"""Keep outliers in FP16, quantize rest to 4-bit."""
# Find outlier channels
channel_max = np.abs(weight).max(axis=1)
median_max = np.median(channel_max)
outliers = channel_max > outlier_threshold * median_max
# Quantize non-outliers
q_weight = np.zeros_like(weight, dtype=int)
scales = np.zeros(weight.shape[0])
for c in range(weight.shape[0]):
if not outliers[c]:
q_weight[c], scales[c], _ = quantize(weight[c], bits)
return q_weight, scales, outliers, weight[outliers] # Keep outliers in FP1618.5 Modern Quantization Techniques
18.5.1 GPTQ: Optimal Brain Quantization
GPTQ (Generative Pretrained Transformer Quantization) uses an insight from optimal brain damage: quantize weights in order of importance, updating remaining weights to compensate.
def gptq_quantize_layer(W, H, bits=4):
"""
GPTQ: Quantize weights using Hessian information.
W: Weight matrix to quantize
H: Hessian (approximated as X^T X from calibration data)
"""
n, m = W.shape
Q = np.zeros_like(W) # Quantized weights
# Process columns in order of Hessian diagonal
order = np.argsort(np.diag(H))[::-1] # Most important first
for i in order:
# Current column
w = W[:, i]
# Quantize this column
q, scale, zp = quantize(w, bits)
Q[:, i] = dequantize(q, scale, zp)
# Compute quantization error
error = w - Q[:, i]
# Update remaining columns to compensate
# This is the key insight: distribute the error
for j in range(i + 1, m):
if j not in order[:i]:
W[:, j] -= error * H[i, j] / H[i, i]
return QGPTQ achieves near-lossless 4-bit quantization by distributing quantization error across unquantized weights.
18.5.2 AWQ: Activation-Aware Weight Quantization
AWQ observes that some weights are more important than others—specifically, weights that interact with large activations:
def awq_quantize(W, activations, bits=4):
"""
AWQ: Scale weights before quantization based on activation magnitude.
"""
# Compute per-channel activation magnitude
act_scale = activations.abs().mean(dim=0)
# Scale up important weights (those with high activation)
# This gives them more quantization bins
importance = act_scale / act_scale.mean()
scaled_W = W * importance.unsqueeze(0)
# Quantize the scaled weights
Q, scales, zp = quantize(scaled_W, bits)
# Compensate in scale factors
# Final: Q * scale / importance = W_quantized
adjusted_scales = scales / importance
return Q, adjusted_scalesAWQ’s insight: if a weight interacts with large activations, its quantization error gets amplified. Protect those weights.
18.5.3 SmoothQuant: Migrating Difficulty
SmoothQuant observes that activations are harder to quantize than weights (more outliers). Solution: mathematically migrate the difficulty from activations to weights:
\[Y = XW = (X \cdot \text{diag}(s)) \cdot (\text{diag}(s)^{-1} \cdot W) = \hat{X}\hat{W}\]
where \(s\) is chosen to balance the quantization difficulty:
def smooth_quant_transform(X, W, alpha=0.5):
"""
Transform to balance quantization difficulty between X and W.
alpha controls the migration:
alpha=0: all difficulty stays in X
alpha=1: all difficulty moves to W
alpha=0.5: balanced (typically best)
"""
# Compute per-channel scales
act_scales = X.abs().max(dim=0) # Activation ranges
weight_scales = W.abs().max(dim=0) # Weight ranges
# Smooth factor
s = (act_scales ** alpha) / (weight_scales ** (1 - alpha))
# Apply transformation
X_smooth = X / s
W_smooth = W * s.unsqueeze(0)
return X_smooth, W_smooth18.6 Hardware Acceleration
Quantization isn’t just about memory—it’s about compute speed.
18.6.1 Integer Tensor Cores
Modern GPUs have specialized hardware for integer matrix multiply:
NVIDIA A100 (per SM):
FP32: 19.5 TFLOPS
FP16: 156 TFLOPS (8× faster)
INT8: 624 TOPS (32× faster than FP32)
The speedup compounds with memory savings:
FP16 → INT8: 2× memory reduction × 4× compute speedup = 8× throughput
But there’s a catch: quantization overhead.
18.6.2 The Quantization-Dequantization Overhead
Quantized compute requires: 1. Dequantize inputs (or keep in integer domain) 2. Compute in integer 3. Requantize outputs (if chaining quantized layers)
def quantized_matmul(Q_W, scale_W, X, scale_X):
"""
Quantized matrix multiply with scale bookkeeping.
"""
# Quantize input
Q_X = quantize(X / scale_X)
# Integer matmul (fast on GPU)
Q_Y = Q_X @ Q_W # INT8 × INT8 → INT32
# Dequantize output
Y = Q_Y * (scale_X * scale_W)
return YThe overhead is amortized over large matrix multiplies, making quantization most beneficial for: - Large models (more compute per overhead) - Memory-bound operations (memory savings dominate) - Batched inference (amortize per-batch overhead)
18.6.3 Where Quantization Helps Most
Inference regime analysis:
Batch size 1 (latency-sensitive):
Memory-bound → quantization helps via memory reduction
Speedup: ~2-4× from INT4 vs FP16
Large batch (throughput):
Compute-bound → quantization helps via faster compute
Speedup: ~2-8× depending on model and hardware
Training:
Gradient precision matters more
Typically FP16 or FP8, not INT4
18.7 Practical Quantization
18.7.1 When to Use Each Technique
| Technique | Bits | Quality | Speed | Best For |
|---|---|---|---|---|
| FP16 | 16 | Baseline | Baseline | Training, high-quality |
| INT8 (naive) | 8 | Good | 2× | Simple deployment |
| INT8 (smooth) | 8 | Near FP16 | 2× | Production LLM serving |
| INT4 (GPTQ) | 4 | Good | 3-4× | Memory-constrained |
| INT4 (AWQ) | 4 | Better | 3-4× | Quality-sensitive apps |
| INT2-3 | 2-3 | Degraded | 4-6× | Extreme compression |
18.7.2 Quality Evaluation
Always measure perplexity (or task-specific metrics) before and after:
def evaluate_quantization(model, quantized_model, eval_data):
"""Compare original and quantized model quality."""
# Perplexity (language modeling)
ppl_original = compute_perplexity(model, eval_data)
ppl_quantized = compute_perplexity(quantized_model, eval_data)
print(f"Original perplexity: {ppl_original:.2f}")
print(f"Quantized perplexity: {ppl_quantized:.2f}")
print(f"Perplexity increase: {(ppl_quantized - ppl_original):.2f} ({(ppl_quantized/ppl_original - 1)*100:.1f}%)")
# Rule of thumb:
# < 1% perplexity increase: Excellent
# 1-5% increase: Good for most applications
# > 5% increase: May affect downstream tasks
# Typical results for LLaMA-2 70B:
# FP16: 3.12 perplexity
# INT8: 3.14 perplexity (+0.6%)
# INT4 (GPTQ): 3.18 perplexity (+1.9%)
# INT4 (AWQ): 3.15 perplexity (+1.0%)18.7.3 Common Pitfalls
- Quantizing without calibration data: Results in poor scale estimates
- Ignoring outliers: Uniform quantization fails on LLMs
- Wrong granularity: Per-tensor is too coarse; per-channel is usually needed
- Evaluating on wrong metric: Perplexity may hide task-specific degradation
18.8 The Derivation Pattern
How would you discover modern quantization if it didn’t exist?
Observe: Models work fine at lower precision during inference
Measure: Find that 8-bit is mostly fine, but 4-bit has outlier problems
Analyze: Discover that outliers are concentrated in specific channels/layers
Design solutions:
- Per-channel scales (different ranges per channel)
- Mixed precision (keep outliers in FP16)
- Error compensation (GPTQ)
- Preemptive scaling (AWQ, SmoothQuant)
Validate: Measure perplexity and downstream tasks
The theme: understand the failure mode (outliers), then design around it.
18.9 Key Takeaways
Neural networks are robust to quantization: Training noise, flat minima, and overparameterization create tolerance
Outliers are the challenge: LLMs have emergent features 100× larger than typical values
Solutions exist: GPTQ, AWQ, SmoothQuant each address outliers differently
Hardware matters: INT8 tensor cores provide 4-8× speedup over FP16
Always measure: Perplexity changes hide task-specific degradation
18.10 Advanced Quantization Methods
18.10.1 GGUF and the llama.cpp Ecosystem
GGUF (GPT-Generated Unified Format) is the standard format for running quantized LLMs on consumer hardware, particularly CPUs.
GGUF Quantization Types:
Type Bits Quality Size (7B model) Use Case
────────────────────────────────────────────────────────────
Q8_0 8 Excellent 7.2 GB Best quality
Q6_K 6 Very Good 5.5 GB Good balance
Q5_K_M 5 Good 4.8 GB Recommended default
Q4_K_M 4 Good 4.1 GB Memory-limited
Q4_0 4 Decent 3.8 GB Fast, lower quality
Q3_K_M 3 Acceptable 3.3 GB Extreme compression
Q2_K 2 Degraded 2.7 GB Experimental
K-Quants: The _K_ variants use non-uniform quantization with importance weighting:
# Conceptual K-quant approach
def k_quant_block(weights, block_size=32):
"""
K-quants use mixed precision within blocks.
More important weights get more bits.
"""
importance = compute_importance(weights)
# High-importance weights: more bits
# Low-importance weights: fewer bits
bits_allocation = allocate_bits(importance, target_avg_bits=4)
quantized = []
for w, bits in zip(weights, bits_allocation):
quantized.append(quantize_to_bits(w, bits))
return quantizedUsing GGUF models:
# With llama-cpp-python
from llama_cpp import Llama
# Load quantized model
llm = Llama(
model_path="./llama-7b.Q4_K_M.gguf",
n_ctx=4096,
n_gpu_layers=35 # Offload layers to GPU
)
# Generate
output = llm("Explain quantum computing:", max_tokens=256)When to use GGUF: - CPU inference (optimized for AVX2/AVX-512) - Consumer hardware (gaming GPUs, Mac) - Mixed CPU/GPU execution - Easy deployment (single file, no dependencies)
18.10.2 HQQ: Half-Quadratic Quantization
HQQ achieves high-quality quantization without calibration data:
# HQQ key insight: optimize quantization parameters directly
def hqq_quantize(W, bits=4, axis=1):
"""
HQQ: Zero-shot quantization via half-quadratic optimization.
No calibration data needed!
"""
# Initialize scale and zero point
scale = (W.max(axis=axis) - W.min(axis=axis)) / (2**bits - 1)
zero = W.min(axis=axis)
for iteration in range(num_iterations):
# Quantize with current parameters
Q = torch.round((W - zero) / scale).clamp(0, 2**bits - 1)
W_hat = Q * scale + zero
# Optimize scale and zero to minimize ||W - W_hat||^2
# This is a half-quadratic splitting problem
scale, zero = optimize_params(W, Q, scale, zero)
return Q, scale, zero
# Usage with HQQ library
from hqq.core.quantize import HQQLinear
# Replace linear layers with HQQ quantized versions
quantized_layer = HQQLinear(
linear_layer,
quant_config={'weight_quant_params': {'nbits': 4, 'group_size': 64}}
)HQQ advantages: 1. No calibration data required 2. Fast quantization (minutes, not hours) 3. Competitive quality with GPTQ/AWQ 4. Good for dynamic quantization scenarios
18.10.3 AQLM: Additive Quantization for LLMs
AQLM uses vector quantization with learned codebooks:
# AQLM: Instead of scalar quantization, use vector quantization
def aqlm_concept(W, num_codebooks=2, codebook_size=256):
"""
AQLM: Represent weight vectors as sums of codebook entries.
W[i] ≈ C1[idx1[i]] + C2[idx2[i]] + ...
"""
# Learn codebooks on calibration data
codebooks = []
for c in range(num_codebooks):
codebook = learn_codebook(W, size=codebook_size)
codebooks.append(codebook)
# Subtract codebook contribution for next round
W = W - lookup(codebook, W)
# Quantized representation: just indices
indices = [[find_nearest(W_row, cb) for cb in codebooks] for W_row in W]
return indices, codebooks
# Dequantization: sum codebook entries
def aqlm_dequantize(indices, codebooks):
W_reconstructed = 0
for idx, cb in zip(indices, codebooks):
W_reconstructed += cb[idx]
return W_reconstructedAQLM achieves extreme compression:
Model Method Bits Perplexity
─────────────────────────────────────────
LLaMA-7B FP16 16 5.68
LLaMA-7B GPTQ 4 5.85 (+3.0%)
LLaMA-7B AQLM 2 6.12 (+7.7%)
LLaMA-7B AQLM 1.5 6.89 (+21%)
AQLM enables <2-bit quantization with reasonable quality.
18.10.4 SpQR: Sparse Quantization with Outliers
SpQR handles outliers by keeping them sparse in high precision:
def spqr_quantize(W, bits=4, outlier_fraction=0.01):
"""
SpQR: Quantize most weights, keep outliers in FP16.
Outliers stored in sparse format for efficiency.
"""
# Identify outliers (top fraction by magnitude)
threshold = torch.quantile(W.abs(), 1 - outlier_fraction)
outlier_mask = W.abs() > threshold
# Store outliers in sparse format
outliers_sparse = to_sparse(W[outlier_mask])
# Quantize the rest
W_normal = W.clone()
W_normal[outlier_mask] = 0 # Zero out outliers
Q, scale, zp = quantize(W_normal, bits)
return Q, scale, zp, outliers_sparse, outlier_mask
def spqr_dequantize(Q, scale, zp, outliers_sparse, outlier_mask):
W = dequantize(Q, scale, zp)
W[outlier_mask] = from_sparse(outliers_sparse)
return WSpQR insight: 1% of weights in FP16 + 99% in INT3 ≈ 3.1 bits effective with minimal quality loss.
18.11 Quantization-Aware Training (QAT)
18.11.1 Post-Training vs Training-Aware
All methods above are post-training quantization (PTQ)—applied after training.
Quantization-aware training (QAT) simulates quantization during training:
class QATLinear(nn.Module):
"""Linear layer with quantization-aware training."""
def __init__(self, in_features, out_features, bits=4):
super().__init__()
self.weight = nn.Parameter(torch.randn(out_features, in_features))
self.bits = bits
# Learnable quantization parameters
self.scale = nn.Parameter(torch.ones(1))
self.zero_point = nn.Parameter(torch.zeros(1))
def forward(self, x):
if self.training:
# Fake quantization: quantize then dequantize
# Gradients flow through using straight-through estimator
W_q = fake_quantize(self.weight, self.scale, self.zero_point, self.bits)
else:
W_q = real_quantize(self.weight, self.scale, self.zero_point, self.bits)
return F.linear(x, W_q)
def fake_quantize(W, scale, zero_point, bits):
"""
Fake quantization with straight-through estimator.
Forward: quantize → dequantize (simulates quantization)
Backward: gradients pass through unchanged
"""
# Quantize
W_int = torch.round(W / scale + zero_point)
W_int = W_int.clamp(0, 2**bits - 1)
# Dequantize
W_fake = (W_int - zero_point) * scale
# Straight-through estimator: use W_fake for forward, W.grad for backward
return W + (W_fake - W).detach()18.11.2 QAT Workflow
def qat_training(model, train_loader, epochs=3):
"""
Quantization-aware training workflow.
"""
# 1. Replace layers with QAT versions
model = prepare_qat(model, bits=4)
# 2. Train with fake quantization
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-5)
for epoch in range(epochs):
for batch in train_loader:
loss = model(batch)
loss.backward()
optimizer.step()
optimizer.zero_grad()
# 3. Convert to actual quantized model
quantized_model = convert_to_quantized(model)
return quantized_model
# PyTorch native QAT
import torch.quantization as quant
model = MyModel()
# Prepare for QAT
model.qconfig = quant.get_default_qat_qconfig('fbgemm')
model_prepared = quant.prepare_qat(model)
# Train...
train(model_prepared)
# Convert to quantized
model_quantized = quant.convert(model_prepared)18.11.3 When to Use QAT
PTQ (Post-Training Quantization):
✓ No training data/compute needed
✓ Fast (minutes to hours)
✓ Good for 8-bit and 4-bit
✗ Quality degrades at extreme compression
QAT (Quantization-Aware Training):
✓ Best quality at low bit-widths
✓ Model adapts to quantization
✓ Essential for 2-3 bit quantization
✗ Requires training (expensive)
✗ Needs training data
Decision: - INT8: PTQ is usually sufficient - INT4: PTQ (GPTQ/AWQ) often works, try QAT if quality insufficient - INT2-3: QAT almost always needed
18.12 Choosing the Right Method
For memory-constrained scenarios:
flowchart LR
A{Memory extremely limited?} -->|Yes| B[SpQR 3-bit<br/>or AQLM 2-bit]
A -->|No| C[AWQ or GPTQ<br/>4-bit]
style B fill:#fee2e2,stroke:#dc2626
style C fill:#dcfce7,stroke:#16a34a
18.13 Connections
Chapter 5 (Factoring): Quantization and low-rank share a theme—the model doesn’t need all its precision
Chapter 12 (LoRA): QLoRA combines quantization with LoRA for efficient fine-tuning
Chapter 2 (Bandwidth): Quantization’s benefit is often memory bandwidth, not compute
18.14 Further Reading
- Dettmers et al. (2022). “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale”
- Frantar et al. (2022). “GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers”
- Lin et al. (2023). “AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration”
- Xiao et al. (2023). “SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models”
- Badri & Shaji (2023). “HQQ: Half-Quadratic Quantization”
- Egiazarian et al. (2024). “AQLM: Extreme Compression of Large Language Models via Additive Quantization”
- Dettmers & Zettlemoyer (2023). “SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression”
- GGUF Format Specification