Few questions come up as often in consulting calls: “Do we need RAG or fine-tuning?” The answer is rarely one or the other alone. To put an LLM into production for a use case in 2026, you have five main methods to choose from — with very different costs, effort and effects. This article makes the trade-offs visible and gives a decision matrix for practice.
1. Why the question keeps coming up
A generic LLM can do a lot, but rarely exactly what a company needs: the right tone, your own vocabulary, current products, internal process rules, domain-specific classifications. That gap creates the need for adaptation — and with it the question of the right method.
Wrong method choice is one of the most common reasons for failed AI projects. Using fine-tuning for a knowledge problem trains a model on a data state that’s outdated three weeks later. Using RAG for a behavior problem leaves you wondering why the model still sounds like a generic assistant. More in Why AI projects fail.
2. The five methods at a glance
Prompt engineering. Steer model behavior via carefully formulated inputs — system prompts, few-shot examples, structured output schemas. No training, no extra hardware. Effective but limited: a prompt can’t teach a model anything new, only retrieve existing skills better.
RAG (retrieval-augmented generation). External data sources are searched at query time; relevant hits flow into the prompt. The model answers using that additional context. Ideal for factual knowledge that changes frequently. See Embeddings and vector databases.
LoRA / parameter-efficient fine-tuning. Instead of changing all model weights, only small adapter matrices are trained. Cheap, fast, doable on consumer hardware, without damaging the base model. More in LoRA explained.
Full fine-tuning. All model weights are further trained. Maximum effect, maximum cost, highest risk (catastrophic forgetting). Only worthwhile for deep behavior changes.
Model distillation. A smaller model is trained to imitate a larger one’s answers. Reduces latency and cost in production, often with only mild quality loss. More in Model distillation.
3. Six decision criteria
The right method depends on six factors:
- Data volatility. Does content change daily, weekly, or hardly? Volatile ⇒ RAG. Stable ⇒ fine-tuning.
- Type of adaptation. Knowledge (facts, documents) or behavior (style, format, domain language)? Knowledge ⇒ RAG. Behavior ⇒ fine-tuning.
- Data volume. Under 100 examples ⇒ prompt engineering or RAG. 500–5,000 examples ⇒ LoRA. 50,000+ examples ⇒ consider full fine-tuning.
- Budget and hardware. No GPUs ⇒ prompt engineering or API-RAG. One GPU ⇒ LoRA. Multi-GPU cluster ⇒ full fine-tuning or distillation.
- Compliance and data sovereignty. Sensitive data often can’t flow into API calls. On-premise LoRA or distillation on German infrastructure ⇒ see Secure AI integration.
- Inference cost and latency. Distillation and smaller LoRA-adapted models are cheaper in production than a large generic model with elaborate prompts.
4. Decision matrix
A simplified heuristic:
- “Answer in our company’s style, based on our handbook”: Fine-tuning (style) + RAG (handbook).
- “Answer questions about our product documentation, which we update weekly”: Pure RAG.
- “Extract structured data from standardized contracts”: Prompt engineering with JSON schema, possibly LoRA if the contracts are highly domain-specific.
- “Generate code in our internal framework”: LoRA on a code model, because the framework is unknown to the model.
- “Classify support tickets into 30 categories with high accuracy”: LoRA fine-tuning on historical tickets.
- “We want our expensive cloud model cheaper in production”: Distillation.
5. When to combine methods
In reality, a combination almost always wins:
- RAG + prompt engineering: Standard for most knowledge applications. Structured system prompts steer behavior, RAG delivers context.
- LoRA + RAG: Domain language plus current knowledge. Example: a medically fine-tuned LLM with RAG on current guidelines.
- Distillation + LoRA: Big model for eval and data generation, small model for production — with LoRA adapters for domain specifics.
- Prompt engineering + structured outputs: Pydantic or JSON schemas force the model into a checkable form. See also Tool calling, function calling and MCP.
6. Common wrong choices
- Fine-tuning a knowledge problem. Training a model on a dataset that constantly changes is a dead end — the first data update makes the training worthless.
- RAG without clean data. RAG doesn’t help if sources are contradictory, outdated or unstructured. Data work is the actual work.
- Full fine-tuning for style. LoRA reaches almost the same effect at a fraction of the cost.
- Prompt engineering as permanent solution. Prompts get long, fragile and hard to maintain over time. Once you need reproducible quality, LoRA or RAG is the more robust choice.
- Distillation without eval suite. Without hard evaluation you can’t measure whether the smaller model is really equivalent. The “smaller will do” assumption is dangerous.
7. Recommended approach
From our consulting practice:
- Define the use case cleanly. One task, one measurable metric, one acceptance criterion.
- Start with the simplest method. Prompt engineering plus a generic model. If that’s enough — lucky you.
- Add RAG when knowledge needs grow. Vector database, chunking, hybrid search.
- LoRA when behavior must be consistently adapted. Clean datasets, reproducible training, eval.
- Distillation only at the end. Reduction to a smaller production model is worthwhile only once the method choice is clear.
- Eval from day one. Without reproducible evaluation, every method choice is gut feeling. More in Guardrails, evals and prompt injection.
The honest takeaway: there is no universally best method. Only the best method for a concrete task — and that emerges from a clean analysis of data, requirements and constraints. Skipping that analysis and jumping straight into fine-tuning or RAG because it sounded good in a LinkedIn post means building on sand. Doing that analysis gets you a production AI solution that genuinely fits your business within a few weeks.
Frequently asked questions.
/ 01Is RAG always the alternative to fine-tuning?
No. RAG and fine-tuning solve different problems. RAG brings knowledge into the context — good for facts, documents, frequently changing content. Fine-tuning changes the model's behavior — good for tone, style, domain-specific language, structured outputs. Confusing knowledge with behavior leads to the wrong method.
/ 02What does fine-tuning an LLM realistically cost?
Heavily dependent on method and model size. LoRA on an 8B model runs on a single consumer GPU in a few hours for under €100 in electricity. Full fine-tuning of a 70B model needs a multi-node GPU cluster and four- to five-figure budgets. Engineering time for dataset curation and eval is usually more expensive than the GPU hours.
/ 03When is prompt engineering alone sufficient?
When the base model can do the task in principle, the needed information fits the context window, and the requirements stay stable. Example: structured data extraction from standardized forms, simple classification, short translation. As soon as you need your own data, large knowledge bases or specific language styles, prompt engineering isn't enough.
/ 04Is LoRA worth it compared to full fine-tuning?
Almost always. LoRA reaches 90–95% of the quality of full fine-tuning in most cases at a fraction of the cost and hardware. Full fine-tuning is only justified when the model's behavior should be fundamentally changed — rare in business contexts. Details in LoRA explained.
/ 05What about model distillation?
Distillation produces a smaller, cheaper model from a large, expensive one — imitating its style and answers. Useful when a use case is qualitatively solved by a big model but production needs to run cheaper. See Model distillation.
/ 06Can RAG and fine-tuning be combined?
Yes, and in many enterprise setups it's the best architecture. Fine-tuning shapes the model with style, format and domain language; RAG delivers current facts at runtime. Combined you get a domain-specific model that stays current — without retraining for every data update.