Tokens are what LLMs actually compute on — not words, not characters, but subword units produced by a tokenizer. Anyone running LLMs productively must understand tokenization and context windows, or they overpay and get unexpected quality drops. This article explains the mechanics and gives practical strategies for long documents.
1. What a token is
A token is the smallest unit an LLM works with internally. Tokens are typically subwords — parts of words, whole words, or single characters, depending on frequency. The tokenizer decides during model training which sequences are merged into a single token.
Example: the word “understanding” could be one token (if frequent in training), two tokens (under + standing), or several. It depends on the tokenizer.
Rules of thumb:
- English: ~4 chars/token, ~0.75 words/token.
- German: ~3.5 chars/token, ~0.5–0.7 words/token (long compounds split).
- Code: ~5 chars/token, language dependent.
- Logographic languages (Chinese, Japanese): ~1–2 chars/token.
2. Tokenizers compared
Two main families dominate:
- BPE (Byte Pair Encoding). Iterative merging of frequent subwords. OpenAI tiktoken, GPT line.
- SentencePiece / Unigram. Statistical selection of optimal subword units. Llama, Mistral.
Practical consequences:
- Multilingual coverage. English-trained tokenizers (GPT-2/3, Llama-1) are often inefficient for non-English. Modern ones (Llama-3, tiktoken-cl100k, Mistral) are balanced.
- Vocabulary size. Larger vocab = more efficient encoding (fewer tokens per text) but larger embedding matrix. Typical 32K–256K tokens.
- Special tokens.
<|begin_of_text|>,<|user|>,<|tool|>, etc. — critical for chat formats and tool calling.
Relevant for enterprises: when choosing between models, compare tokenizer efficiency on your texts. For heavily German workloads, a good tokenizer can save 30%+ in cost.
3. Context windows in 2026
Usable input length has grown dramatically:
- 8K tokens (2023 standard). One page of text plus answer. Good for short tasks.
- 32K (Llama-2 large, GPT-4 standard). Several pages. Sensible conversations, mid-length documents.
- 128K (GPT-4-Turbo, Llama-3.1, Claude 3). Medium books, large code files. 2026 standard for most productive models.
- 200K–1M (Claude 3.5+, Gemini 2.0+). Full codebases, book collections, whole database exports. Productively available in 2026.
- Research at 10M+. Experimental. Rarely needed in practice.
Important: nominal context size says nothing about actual usage quality. Eval on realistic long-context tasks is mandatory.
4. What long contexts really deliver
Long context solves real problems but has weaknesses:
Strengths:
- Whole codebase analysis without chunking.
- Contract review across all pages simultaneously.
- Book summaries from full content.
- Long dialog memory without separate retrieval.
Weaknesses:
- Lost in the middle. Information in the center of context is used worse than at the start and end.
- Cost scales. Inference cost and latency grow quickly.
- KV cache memory. At 1M tokens the cache needs several GB GPU memory per request.
- Quality drift. Models nominally supporting 1M tokens are often suboptimal already at 100K.
Practice: long context where it fits, augmented by RAG for very large knowledge bases.
5. Token cost and latency
Tokens cost money on two levels:
- API prices typically 0.001–0.1 USD per 1,000 output tokens. Input usually cheaper, sometimes half-price.
- Self-hosted cost. Hardware utilization, electricity. At very high volume 5–10× cheaper than API.
Latency grows linearly with output tokens (decode phase) and roughly linearly with input tokens (prefill phase, parallelizable). Long inputs raise prefill time; long outputs raise decode time. Streaming the answer can reduce perceived latency. More in LLM inference.
6. Strategies for long documents
When documents exceed the context window:
- Chunking. Split documents into semantically meaningful pieces, process individually. See Embeddings and vector databases.
- Map-reduce. Summarize each chunk separately, then merge. Scales arbitrarily but can lose details.
- Hierarchical summarization. Multi-level: summarize sections, then summarize the summaries, etc.
- Retrieval memory. Select relevant parts by embedding search and only load those into context.
- Sliding window. For long conversations: summarize old content, keep recent in detail.
The right strategy depends on the use case. For “question about a 1,000-page document” RAG usually beats long context. For “refactoring a codebase” long context is often appropriate.
7. Practice: planning context architecture
From consulting practice, four steps:
- Define per-request token budget. What’s the max cost an answer may incur?
- Test tokenizer on your texts. Simulate the real application, measure tokens. Compare across models if needed.
- Choose a context strategy. RAG, long context, hybrid — suited to data structure.
- Eval on edge cases. Lost-in-the-middle tests, very long and very short inputs, multilingual mixtures.
Tokenization and context windows are not just backend details but direct architecture decisions affecting cost, latency, and quality. In 2026 the choice of models and strategies is rich enough to find a fitting combination for almost any use case — provided you know the trade-offs. Understanding the mechanics builds applications that pencil out. Ignoring it either overpays or produces answers that miss key parts.
Frequently asked questions.
/ 01What is a token?
A token is the smallest unit an LLM processes. Tokens are typically subwords — a word like 'understanding' may split into multiple tokens, while a common word like 'and' is a single token. Rule of thumb: a token ≈ 0.75 English words or ~4 characters. German tends to produce slightly more tokens per character.
/ 02What does a 128K context window mean?
The model can process at most 128,000 tokens at once — input plus generated answer together. 128K tokens correspond to ~90,000 words or a medium-length novel. Longer inputs are truncated or require chunking strategies.
/ 03Are 1M-token-context models always better?
No. Long context windows solve a memory problem, not necessarily a comprehension problem. Models often exhibit 'lost in the middle' effects: information at the start and end of context is used better than in the middle. Eval on realistic long-context tasks remains mandatory.
/ 04How do different tokenizers affect token cost?
Substantially. A 1,000-word German text costs noticeably more tokens with an English-optimized tokenizer (e.g. GPT-2/3) than with a multilingual one (e.g. tiktoken-cl100k or Llama-3). For German applications, tokenizer differences can mean 30–50% cost variation.
/ 05When do I need long context vs. RAG?
Long context is worth it when the entire content could be relevant to the answer (codebase analysis, long contracts, book summarization). RAG is better when sources are large but only small parts are relevant per query (knowledge bases, product catalogs). Often combined: RAG for preselection, long context for detailed analysis.
/ 06What's the cost of long-context inference?
With modern implementations costs scale linearly with token count (though attention compute remains quadratic). A 100K-token request typically costs 50–100× a 1K request. The KV cache also needs much more GPU memory. More in LLM inference.