Why is token generation sequential?

An LLM generates tokens autoregressively — each new token depends on all previous ones. Token N+1 can only be computed after token N. This sequentiality limits parallelism and hence latency. Tricks like speculative decoding try to work around it.

What is the KV cache?

The key-value cache stores intermediate attention results for all already-processed tokens. Without it, every new token would require reprocessing all previous ones — latency would be unusable. With the KV cache, answer generation grows linearly instead of quadratically.

What is vLLM and why does it matter?

vLLM is a high-performance open-source inference stack for LLMs. It brings PagedAttention (KV cache management like virtual memory), continuous batching (efficient parallelism), and many optimizations together. On common hardware vLLM reaches 5–20× higher throughput than naive implementations.

How are latency and throughput related?

Often opposed. Low per-request latency = small batches = low throughput. High throughput = large batches = higher per-request latency. Continuous batching tries to combine both by dynamically adding and removing requests from the batch.

What does an inference token cost?

Commercial APIs charge 0.001–0.1 USD per 1,000 output tokens, depending on model. Own operation on-premise or in cloud can land at 0.0001–0.01 USD per 1,000 tokens — depending on model size, hardware utilization and electricity. At very high volume, own operation usually wins.

What is speculative decoding?

An acceleration technique: a small, fast draft model proposes multiple tokens, the large model verifies them in parallel. When the draft model is often correct, it speeds up answers 2–4×. Standard in modern inference stacks.

LLM Inference: Latency, Cost, Throughput Explained (2026)

An LLM answer can arrive in 100 ms or in 5 seconds. It can cost 0.0001 USD or 0.1 USD. These differences are not magic — they’re the result of concrete technical decisions: model size, hardware, quantization, inference stack, batching strategy. Operating LLMs in production without understanding these mechanics either burns money or ships unusable UX. This article explains the building blocks.

1. What happens during an LLM answer

LLM inference has two phases:

Prefill. The entire input (system prompt, context, user question) is processed in one pass. All tokens are computed in parallel — that’s GPU-efficient and fast.
Decode. The answer is generated token by token. Each new token depends on all previous ones. Sequential, and the main source of latency.

Important: with long inputs, prefill can take longer than decode. With short inputs and long answers, decode dominates. Both phases have different performance characteristics and optimization strategies.

2. Prefill and decode

Prefill is compute-bound: the GPU runs at full compute, memory is less critical. On modern GPUs prefill processes thousands of tokens per second.

Decode is memory-bandwidth-bound: for each token the model must load all parameters to compute the next one. GPU compute is barely used; the bottleneck is memory bandwidth. An A100 has ~2 TB/s memory bandwidth — for a 70B model in FP16 (140 GB) that allows roughly 14 forward passes per second, i.e. ~14 tokens per second per stream.

Consequence: per-token latency depends mostly on model size and memory bandwidth. Faster answers require either shrinking the model (quantization, distillation) or upgrading hardware (H100, MI300).

3. The KV cache — boon and memory problem

If decode re-processed the whole input each time, complexity would be quadratic in sequence length. The solution: the KV cache stores the key and value tensors of attention layers for all previous tokens. Each new token computes only its own contribution and appends to the cache.

The KV cache is powerful — but expensive. Each token consumes several KB of memory (depending on model size and layer count). A request with a 100,000-token context needs several GB of GPU memory just for the cache. That limits how many concurrent requests fit on a GPU.

Modern inference stacks like vLLM use PagedAttention — a memory management technique analogous to OS virtual memory. KV cache blocks are allocated and freed on demand, making memory usage dramatically more efficient.

4. Batching and throughput

A single request uses the GPU poorly — it’s mostly waiting on memory bandwidth. With batching, multiple requests run in parallel: the model is loaded once but serves several requests.

Three batching modes:

Static batching. All requests in the batch must start and end together. Bad, because short requests wait on long ones.
Continuous batching. Requests are dynamically added and removed as they complete. Standard in modern stacks (vLLM, TGI).
Speculative batching. Speculative tokens are verified in parallel (see below).

Continuous batching raises throughput 3–10× over static batching without materially worsening latency. It’s the single most important engineering lever for productive LLM inference.

5. FlashAttention and PagedAttention

FlashAttention is a memory-efficient implementation of attention. Instead of materializing the full attention matrix, it computes in small blocks fitting GPU SRAM. Result: 2–4× faster attention with less memory. Now standard in almost every inference stack.

PagedAttention (vLLM) brings virtualized KV cache management. It makes GPU memory nearly twice as efficient and enables many more parallel requests per GPU.

Combined, FlashAttention plus PagedAttention already deliver 5–10× better throughput than naive implementations — at identical answer quality.

6. Speculative decoding and other tricks

Speculative decoding sidesteps decode sequentiality: a small, fast draft model proposes several tokens (e.g. 4–8 at once), the large model verifies them in parallel. With high agreement, 2–4× speedup is possible. Prerequisite: a good draft model, often a distillate of the large one.

Medusa heads are a variant: instead of a separate draft model, multiple prediction heads are attached to the same model to estimate tokens in parallel.

Chunked prefill breaks very long inputs into parts to smooth peak memory usage.

Multi-query and grouped-query attention reduce KV cache memory by sharing key/value tensors across attention heads.

These techniques stack. A modern inference stack with vLLM, FlashAttention, PagedAttention, continuous batching and speculative decoding is 20–50× faster than a naive PyTorch loop — on the same model.

7. Practice: steering latency and cost

These levers prove themselves in consulting projects:

Model size matched to use case. An 8B model is 10× cheaper than a 70B. If quality suffices, the choice is clear. Distillation can pull this lever — see Model distillation.
Quantization. 4-bit quantization halves memory and noticeably speeds up inference. Details in QLoRA and quantization.
Right inference stack. vLLM, TGI or SGLang instead of naive Transformers loops. 10× throughput.
Continuous batching at high batch size. Maximizes GPU utilization.
Stream the answer. Even high total latency feels fast if first tokens appear within 200 ms.
Cache frequent requests. Identical prompts don’t need recomputation each time.
Right hardware. H100, MI300 or L40S — memory bandwidth is the decisive factor, not raw FLOPS.

LLM inference in 2026 is a precisely measurable engineering field. Mastering it lets you run productive applications at costs and latencies that seemed impossible two years ago. Ignoring it burns budget on two axes: too-expensive cloud APIs or too-slow self-hosted setups. With the right stack and hardware, a GPU investment becomes a productive system that lasts for years. For the broader operational picture, see LLMOps.

LLM Inference: Why AI Answers Are Fast, Slow, Cheap or Expensive