Quantization — reducing the bit width of model weights — is the second great democratization of the LLM world after LoRA. What met reservations in 2022 is standard in 2026: training and running models in 4 bits, often with no noticeable quality loss. This article explains how it works, which methods fit which scenario, and where the limits are.
1. The memory problem of modern LLMs
A modern LLM has 8 to 70 billion parameters. At full precision (FP32, 32 bits) a 70B model’s weights alone need 280 GB — fits on no GPU in the world. At half precision (FP16, 16 bits) it’s 140 GB — still too much for a single GPU. Training adds multiple model sizes again for gradients and optimizer state.
Quantization solves this. Instead of storing each value in 16 bits, it’s reduced to 8, 4, or even 2 bits. At 4-bit, a 70B model’s memory drops to about 35 GB — fitting on a single A100-80GB or L40S-48GB.
2. Quantization — the basics
The core idea: instead of storing each parameter as an independent floating-point number, define a codebook of the most common values and reference each parameter by a short index. At 4 bits there are 16 possible values per block; at 8 bits, 256.
For this to work without quality loss, the codebook is chosen to approximate the true weight distribution well. Per-block scaling factors (say every 64 or 128 parameters) further reduce quantization error.
Two modes matter:
- Post-training quantization (PTQ). A finished model is quantized. Fast, often with minimal quality loss.
- Quantization-aware training (QAT). The model is trained with quantization in mind. More effort, but needed for very low bit widths (2-bit).
3. QLoRA: 4-bit training that works
QLoRA (Quantized LoRA), published in 2023, was the breakthrough. The idea:
- Base model in 4-bit (NF4). Frozen, no gradient required. Saves 75% memory versus FP16.
- LoRA adapters in FP16 or BF16. Trained at full precision, but small.
- Double quantization. Scaling factors are themselves quantized — minimal quality cost, further savings.
- Paged optimizer. Optimizer state pages between GPU and CPU on demand, smoothing memory spikes.
The result: a 70B model fine-tunable on a single 48-GB GPU (e.g. L40S or RTX 6000 Ada). On an 80-GB A100 or H100 with comfortable headroom. More on LoRA itself in LoRA explained.
4. NF4, GPTQ, AWQ, GGUF — methods compared
NF4 (Normal-Float 4). Optimized 4-bit format leveraging the typical normal distribution of LLM weights. Standard for QLoRA training. Very good quality, moderate speed.
GPTQ. Inference-oriented 4-bit quantization. Uses Hessian information to minimize quantization error across layers. Computed offline (calibration set needed), then runs very fast with vLLM and TGI.
AWQ (Activation-aware Weight Quantization). Similar to GPTQ but accounts for activations — which weights matter most for the output. Slightly better quality than GPTQ at similar speed.
GGUF. llama.cpp’s format, optimized for CPU and Mac (Apple Silicon) inference. Multiple bit variants (Q2, Q3, Q4, Q5, Q6, Q8). Standard for local inference and edge deployments.
FP8. Native 8-bit floating-point support on H100 and newer GPUs. Very close to FP16 quality, twice as fast, less memory. Standard for modern training setups on large models.
5. Trade-offs: memory, quality, speed
A practical rule of thumb:
- FP16/BF16: Full quality, full memory. Standard for pretraining.
- FP8: ~99% quality, half memory, double speed. Standard for modern training.
- INT8: ~99% quality, half memory, same speed as FP16. Popular for inference.
- NF4 / 4-bit: ~97–98% quality, quarter memory, slightly slower inference than INT8.
- 3-bit: ~94–96% quality, less memory, noticeably slower.
- 2-bit: ~88–92% quality, minimal memory, often usable only after QAT.
For most production systems, 4-bit is the optimum: massive memory savings, near-unchanged quality, still fast enough.
6. Practice: which method when?
- Training (fine-tuning): QLoRA with NF4 is the de facto standard. Alternative on H100/MI300: FP8 with full-parameter training.
- Server inference: GPTQ or AWQ at 4-bit, served with vLLM or TGI. Fast inference, good quality.
- Local / Mac inference: GGUF with llama.cpp or Ollama. Q4_K_M is a good quality/speed compromise.
- Edge / mobile: GGUF at Q2 or Q3 for very small models. See Small language models and edge AI.
In consulting projects we almost always recommend separation: training with QLoRA/NF4, inference with GPTQ or AWQ. Both worlds benefit from their specialized methods.
7. Limits and risks
Quantization isn’t a magic bullet:
- Very small models suffer more. A 1B model at 4-bit can lose quality noticeably. Below 3B, 8-bit is often safer.
- Reasoning tasks are more sensitive. In multi-step reasoning, small quantization errors can accumulate. More in Reasoning models.
- Edge cases get harder to measure. Quantized models can fail unexpectedly on rare inputs. Eval pipelines must account for this — see Guardrails, evals and prompt injection.
- Calibration data shapes GPTQ/AWQ. Choice of calibration data affects quantization quality. Representative domain data improves inference noticeably.
Quantization and QLoRA in 2026 are no longer experimental but indispensable tools for productive LLM setups. Skipping them means either paying too much for hardware or leaving adaptation options unused. With some discipline in tooling and eval, a single GPU and a curated dataset produce a productive, domain-specific system — sovereign, controllable, with clearly plannable costs.
Frequently asked questions.
/ 01What does 4-bit quantization mean concretely?
Instead of storing each model parameter in 16 bits (FP16), it's reduced to 4 bits. That cuts memory by 4×. It's made possible by a codebook representing the most common values efficiently, combined with per-block scaling factors that minimize quantization error.
/ 02Does quantization degrade model quality?
Slightly. With modern 4-bit methods like NF4, GPTQ or AWQ, quality loss on standard benchmarks usually stays below 2%. At 8-bit it's near zero. At 2-bit the loss becomes noticeable — the limit where things still work in practice.
/ 03What's the difference between QLoRA and LoRA?
QLoRA combines LoRA with quantization. The base model is held in 4-bit and stays frozen; adapter matrices are trained in FP16 or BF16. That lets you fine-tune a 70B model on a single 48-GB GPU — impossible with plain LoRA in FP16.
/ 04Which quantization method is best?
Depends on the use case. NF4 is the standard for training (QLoRA). GPTQ and AWQ are optimized for inference — faster, slightly less quality loss. GGUF is the format of the llama.cpp world and standard for local CPU/Mac inference. For server inference with vLLM, GPTQ and AWQ are common.
/ 05Can I run quantized models on-premise?
Yes — one of the main advantages. A quantized 70B model runs on a single 80-GB GPU at reasonable speed. On a workstation with 4× 24-GB GPUs, quantized 405B models become feasible. On-premise deployments become substantially more realistic — see Secure AI integration.
/ 06Does quantization work with smaller models too?
Yes. Smaller models (1–8B) benefit especially because they then run on consumer hardware or even mobile devices. However, very small models react more sensitively to aggressive quantization — 4-bit is usually fine, 2-bit problematic. More in Small language models and edge AI.