Quantization & Inference

Quantization & Inference — CRIN

A 70B model in bfloat16 weighs 140GB — too big for most GPUs. Quantization compresses each weight to 4 bits, shrinking the model to 35GB with less than 2% quality loss. Understanding quantization formats (GPTQ, GGUF, AWQ), serving infrastructure (vLLM, TGI), and batching strategies is what separates a prototype from a production system.

Course: Advanced.

This lesson covers 5 concepts: The Memory Problem, How Quantization Works, GPTQ vs GGUF vs AWQ, Speed vs Quality, Serving at Scale.

The Memory Problem

A 70B model in bfloat16 requires 140GB of GPU VRAM — that is 2 × A100 80GB GPUs minimum. INT8 halves it to 70GB (1 × A100). INT4 halves again to 35GB. With KV cache quantization: ~20GB. Each halving enables running on cheaper hardware.

Quantization is the key that unlocks running large models on accessible hardware. Without it, LLaMA 3 70B requires an $80K+ GPU cluster. With 4-bit quantization, it runs on a $2K consumer GPU.

Model size is the gating constraint for running large models. Quantization removes the constraint without removing the capability.

AWS p4d.24xlarge (8×A100 80GB, $32/hr): runs 70B in bfloat16. Lambda Cloud A10 (24GB, $0.60/hr): runs 70B only in 4-bit. Cost difference: 53×. Quantization makes the difference.

How Quantization Works

Quantization maps continuous weight values (float32: 4 bytes, infinite precision) to discrete integers (int4: 0.5 bytes, 16 values). The heatmap shows original float weights — after quantization, each cell snaps to one of 16 discrete levels, introducing small rounding errors.

The quality of quantization depends on how cleverly the float → int mapping is chosen. Uniform quantization (equal steps) is simple. Non-uniform (NF4, AWQ) places more steps where weights are densest.

Like converting a 24-megapixel photo to a 1-megapixel thumbnail. You lose some detail, but the image is still recognisable and useful. Quantization does the same to model weights.

A weight of 0.823 in float32 might quantize to 0.8125 in NF4 (nearest available value). Error: 0.0105. Small — but multiply across 70 billion weights and it accumulates. Calibration-based methods minimise this accumulation.

GPTQ vs GGUF vs AWQ

Three quantization formats serve different deployment scenarios: GPTQ for GPU inference, GGUF for CPU/hybrid inference on consumer hardware, AWQ for best quality at 4-bit on GPU. Choosing the right format for your deployment environment is as important as the bit width.

Format choice determines what hardware you can deploy on. GGUF enables running a 13B model on a MacBook Pro — impossible with GPTQ which requires CUDA.

All three achieve similar final model quality — the choice is about what hardware you are deploying to and what tooling you want to use.

Production API: use AWQ on A100s for best quality-throughput. Local development: GGUF on your MacBook for zero-cost iteration. Fine-tuning: bitsandbytes NF4 during training, convert to AWQ/GPTQ post-merge for deployment.

Speed vs Quality

Quality degradation by quantization level: bfloat16 (baseline 100%), INT8 (99%), INT4 AWQ (98%), INT4 GPTQ (97%), INT2 (88%). The practical sweet spot is INT4 AWQ — 4× memory reduction for less than 2% quality loss.

INT8 offers 2× memory reduction at near-zero quality cost. INT4 offers 4× reduction at 1-2% cost. The jump from INT4 to INT2 costs 10% quality for only 2× more compression — rarely worth it.

Think of it like audio quality: 320kbps (bfloat16) vs 128kbps (INT8, nearly identical) vs 64kbps (INT4, noticeable but fine) vs 16kbps (INT2, audibly bad). INT4 is the sweet spot.

Decision: 70B bfloat16 (100% quality, 140GB) vs 70B INT4 AWQ (98% quality, 35GB). Most applications cannot distinguish the 2% quality gap — and the 4× memory savings enables running on 4× cheaper hardware.

Serving at Scale

Production inference serving requires: vLLM for GPU inference (2–24× throughput over naive serving), continuous batching to fill GPU between requests, speculative decoding for latency reduction, and distillation to serve cheaper models with premium quality.

At 10,000 requests/day: naive serving vs vLLM = $200 vs $20 in compute costs. The infrastructure investment pays for itself in days.

A single large model can serve hundreds of concurrent users efficiently if the serving infrastructure is right. Without it, each user gets the full GPU to themselves — impossibly expensive at scale.

vLLM + AWQ 4-bit on A10G (24GB, $0.75/hr): 70B inference at 12 tokens/sec per request, 6 concurrent requests = 72 tokens/sec total. Naive serving: 1 concurrent request = 12 tokens/sec. 6× throughput improvement.