For years the rule was: bigger models, more data, more GPU hours — that’s how AI gets better. In 2024 and 2025 a second scaling axis arrived: test-time compute. Instead of growing the model, you give it more compute at answer time to think things through. OpenAI’s o-series, DeepSeek-R1 and Gemini Thinking are the prominent examples. This article explains what’s behind it, where it helps — and where it doesn’t.
1. What reasoning models are
A classic LLM answers a question in one pass: tokens come in, tokens come out, linear and unrevised. For easy tasks that works well. For multi-step problems — math with intermediate steps, code generation with architectural decisions, contract analysis with cross-references — the model quickly loses precision because it can’t look back or change direction.
Reasoning models change that. They first produce an internal chain of thought — a longer argument, often thousands of tokens, in which the model proposes hypotheses, checks intermediate results, abandons paths and corrects itself. Only then does the actual, concise answer follow. What looks like a short output to the user was internally a long deliberation.
Crucially: reasoning isn’t just a longer prompt. The models are trained — typically with reinforcement learning on verifiable answers — to produce chains of thought that lead to more correct outputs. So they don’t only learn what to answer, but how to think on the way.
2. Test-time compute — the second scaling axis
Until recently, LLM quality scaled almost exclusively along the training axis: more parameters, more data, more pre-training GPU hours. Reasoning models introduce the inference axis: at fixed model size, quality keeps improving as you allow the model more tokens for its internal argument.
Empirically, on benchmark-style tasks (math, programming, logic) you typically see:
- 1,000 reasoning tokens: solid answer.
- 10,000 reasoning tokens: clearly more accurate.
- 50,000+ reasoning tokens: a plateau at a higher level.
For enterprises that means model size is no longer the only lever. A smaller reasoning model with longer thinking time can outperform a larger classic model — often at lower hardware investment but higher inference cost per request.
3. From chain of thought to trained reasoning
The conceptual predecessor is chain-of-thought prompting — a 2022 technique that showed LLMs get better when asked, via the prompt, to think step by step: “Let’s reason this out.” It still works today, but is limited: the model wasn’t trained for it, it was persuaded into it.
Reasoning models internalize the technique. Instead of triggering it through a prompt hack, it’s part of the training process. Two consequences:
- More reliable. The chain of thought no longer depends on prompt tricks.
- Hideable. Vendors often expose only a summary of the chain, not the raw trace. Open-weight models like DeepSeek-R1 expose it in full.
From an engineering standpoint: the chain of thought is not a trustworthy explanation of the answer — it’s a computation trace. For a real explanation you need other tools, as discussed in our piece on Mechanistic Interpretability.
4. When reasoning models are genuinely useful
Not every task benefits. If you’re building an inbound-email classifier, you don’t need reasoning — the 10× token cost would be wasted. Reasoning pays off when the task has these properties:
- Multi-step. Several intermediate results that build on each other.
- Verifiable. A defined correctness criterion exists.
- Non-trivially branching. Several plausible solution paths.
- Costly errors. A wrong result costs more than ten right ones.
Concrete 2026 examples:
- Code reviews and architecture proposals. Reasoning models catch deeper bugs and weigh architectural trade-offs more carefully. See also AI in software development.
- Contract analysis with cross-references. When clauses reference each other, a longer argument trace helps.
- Multi-step mathematical and financial models. Calculations, optimizations, what-if analyses.
- Complex tool-use sequences for agents. Which API, in which order, with which arguments.
5. Reasoning as the backbone for AI agents
AI agents — autonomous systems that operate multiple tools and coordinate subgoals — benefit disproportionately from reasoning. A classic LLM as an agent brain tends to be impulsive: it calls the first plausible tool without checking the overall plan.
A reasoning model formulates a plan internally, identifies dependencies and checks conditions before the first tool call. In practice we see, in agent workflows with reasoning backbones:
- 30–60% fewer failed tool calls.
- More robust behavior when tool outputs are unexpected.
- Better abort and resume decisions.
The downside: higher latency and cost per step. For more on agent architectures see What is an AI agent?.
6. Limits and trade-offs
Reasoning models are not a silver bullet:
- Latency. Instead of seconds, answers can take minutes. Real-time chats need UX adjustments.
- Cost. 10–100× more tokens per answer is normal.
- Hallucinations don’t disappear. Long deliberation can still produce plausible-sounding wrong answers. Eval remains mandatory.
- Training bias. Reasoning models were trained mostly on math, code and logic. For open or creative tasks, classic models are often on par or better.
- Data privacy at US vendors. Using the o-series sends inputs + internal chain of thought to OpenAI. For sensitive data, consider open-weight models like DeepSeek-R1 on German infrastructure. See Secure AI integration.
7. Practice: when to deploy reasoning models
From advisory practice a simple heuristic: if a human expert would visibly need to think about the task, a reasoning model is a candidate. If a human caseworker handles it in seconds, a classic model is enough.
Concrete steps for an evaluation:
- Isolate the use case. One specific task, one measurable metric.
- Build an eval set. 30–100 real cases with clear correctness criteria.
- Run side-by-side. Classic model versus reasoning model, identical prompt.
- Score four axes. Accuracy, latency, cost per request, reproducibility.
- Decide, document, reproduce. A conscious “no” is a good decision too — if it’s reproducible. See also Why AI projects fail.
Reasoning models in 2026 are no longer an experimental playground but a productive part of the AI stack — if you know where they belong. Use them everywhere and you overpay. Use them nowhere and you miss quality leaps on the tasks that actually need them.
Frequently asked questions.
/ 01How does a reasoning model differ from a classic LLM?
A classic LLM produces its answer in a single forward pass — token by token, no backtracking. A reasoning model first produces an internal chain of thought in which it formulates hypotheses, checks intermediate results, and discards dead ends before emitting the actual answer. This isn't just a different prompt — it's a different training objective: the model learns to take its time.
/ 02What is test-time compute?
Test-time compute means more inference-time (not training-time) compute improves answer quality. Classically, you scaled models via parameters and training data; reasoning models add a second axis — the length of the internal chain of thought. More thinking time = better answer, up to a plateau.
/ 03Are reasoning models more expensive to run?
Yes, significantly. A single answer can consume 10–100× more tokens than a classic model because the internal chain of thought is billed too. For routine tasks they are overkill. For planning-heavy multi-step tasks, the quality gain usually justifies the cost.
/ 04Which reasoning models matter in 2026?
Three families dominate the discourse: OpenAI's o-series (o1, o3), DeepSeek-R1 (open-weight, self-hostable in Germany) and Google's Gemini Thinking. Alongside them are specialized open-weight reasoning distillates like QwQ and several smaller research variants.
/ 05What are the main enterprise use cases?
Complex code reviews, software architecture, mathematical and financial analyses, multi-step planning and negotiation tasks, legal reasoning over contract documents, and agents that need to coordinate several tools. For simple classification or extraction, classic models are the better fit.
/ 06Can I run a reasoning model on-premise?
Yes — DeepSeek-R1 and several open reasoning distillates can be run on-premise or in a sovereign German cloud. Hardware requirements are higher than for classic models because the internal chain of thought needs more KV cache. We help with capacity planning in a discovery workshop.