Fine-tuning sounds like sovereignty: your own model that speaks your language, knows your data, runs on your infrastructure. In practice, many fine-tuning projects fail — not on the tech, but on wrong expectations, bad data, and missing evaluation. This article shows when fine-tuning is really the right lever, and when other methods reach the goal cheaper and more reliably.
1. Why fine-tuning often overpromises
Fine-tuning is often pitched as a universal solution: “We train a model on your data, then it knows your business.” In reality, fine-tuning first means a lot of work, money, risk — and a set of prerequisites without which the result can’t be productively used.
The most common error is confusing knowledge with behavior. Fine-tuning is not the right way to teach a model facts — that’s what RAG is for. Fine-tuning is the right way to teach a model style, format and domain language. Confusing these burns budget.
2. What fine-tuning really changes
Fine-tuning modifies the weight matrix of an LLM — the internal parameters that drive output behavior. This change affects:
- Style and tone. A model fine-tuned on legal texts writes in legally correct phrasing.
- Format and structure. A model trained consistently on JSON output adheres to the format much more reliably.
- Domain-specific vocabulary. Terms rare in pretraining are used more precisely.
- Classification behavior. On tasks with defined categories, accuracy can be substantially raised.
What fine-tuning does not reliably change:
- Factual knowledge in areas where the model saw little pretraining.
- Context understanding over very long documents.
- Truthfulness — fine-tuning does not eliminate hallucinations.
3. Five hard prerequisites
From our consulting practice: if any of these is missing, fine-tuning should be postponed.
- Clearly defined use case. One task, one input shape, one expected output. “Better answers” is not a use case.
- Your own data in sufficient quality and quantity. At least several hundred high-quality examples, cleanly labeled, consistently formatted.
- Eval suite before training. At least 30 real test cases with scoring criteria. Without eval, fine-tuning is blind flight.
- Hardware plan. GPU access (local or cloud), defined budgets, privacy posture clarified.
- Engineering discipline. Reproducible pipelines, versioned datasets, logging. A bash script doesn’t cut it.
4. What does fine-tuning realistically cost?
Three cost drivers:
- Data work. The biggest line item — typically 60–80% of total effort. Collecting, cleaning, labeling, validating training data.
- GPU hours. LoRA on 8B models: 10–100 euros per iteration. Full fine-tuning on 70B models: 1,000–10,000 euros per iteration. Multiple iterations are the rule.
- Eval and deployment. Building the eval suite, packaging the model, setting up monitoring, defining rollback. Time, not hardware.
A first productive LoRA iteration is realistic at 5,000–25,000 euros total. Full fine-tuning of a 70B model with curated dataset and eval lands closer to 80,000–300,000 euros. The math only works if the added value is clearly measurable.
5. Fine-tuning vs. RAG
A pragmatic heuristic:
- If the problem is “answers wrongly because knowledge is missing” → RAG.
- If the problem is “answers stylistically wrong or formally inconsistent” → fine-tuning.
- If both apply → try RAG first, add fine-tuning where needed.
In most enterprise setups RAG solves 60–80% of the issues originally labeled “fine-tuning needed.” Only the remaining 20–40% justify fine-tuning’s effort. See also RAG, fine-tuning or prompt engineering.
6. Risks and pitfalls
- Bad data quality. Fine-tuning on inconsistent, faulty or biased data produces a model that systematically reproduces those flaws. More in Why AI projects fail.
- Catastrophic forgetting. Especially in full fine-tuning, the model loses general capabilities. LoRA mitigates this risk.
- Overfitting. Too small or too homogeneous datasets cause the model to memorize the training data and fail on new examples.
- Becoming outdated. Facts correct today are outdated next year. Fine-tuning on such content locks you into a re-training cycle.
- License and compliance risk. Open-weight models have different licenses (Llama, Apache, MIT, restricted). Clarify before fine-tuning whether the intended use is allowed.
7. Three hard decision rules
Crystallized from 50+ consulting cases:
- No fine-tuning investment without an eval suite. Eval is not optional. Without it, success isn’t measurable.
- LoRA before full fine-tuning. LoRA is enough in 90% of cases. Full fine-tuning is the exception. Details in LoRA explained.
- RAG before fine-tuning. When both seem viable, RAG first — cheaper, more flexible, more maintainable.
Fine-tuning in 2026 is not a secret weapon but a precise tool for a clearly bounded task. Used as a universal solution it burns budget. Used surgically, where preconditions hold, it builds a real competitive advantage — a model that speaks your language, runs on your infrastructure, and improves iteration by iteration. The path leads through clean data, sharp use cases, and a mature eval pipeline — not through vendor pitches.
Frequently asked questions.
/ 01How many training examples do I need for fine-tuning?
For LoRA fine-tuning, 500–5,000 high-quality examples are often enough. For full fine-tuning, 10,000–100,000 examples are typical. Quality beats volume: well-labeled, representative, consistently formatted data outperforms any dataset size.
/ 02How do I tell if fine-tuning or RAG fits?
Rule of thumb: RAG for knowledge, fine-tuning for behavior. If you need to provide frequently changing facts, RAG. If you want to enforce consistent tone, format, domain language or structured outputs, fine-tuning. Details in RAG, fine-tuning or prompt engineering.
/ 03Is fine-tuning worthwhile for a small business?
With LoRA: yes, often very much. Hardware requirements are moderate (one GPU suffices), and costs are a few hundred to a thousand euros per iteration. Preconditions: clearly defined task, your own example data, and an eval strategy.
/ 04How quickly does fine-tuning become outdated?
Behavioral adaptations (style, format, domain language) age slowly — often usable for years. Knowledge adaptations age fast because the model learns facts that change. The latter is the most common reason fine-tuning projects fail: knowledge was trained instead of behavior.
/ 05Do I need my own GPUs for fine-tuning?
No. Cloud providers like Hetzner, Together AI, Modal or Lambda Labs rent GPUs by the hour. A LoRA iteration typically costs 10–50 euros. If you work with sensitive data, choose German or European providers — see Secure AI integration.
/ 06How do I evaluate whether my fine-tuning succeeded?
With an eval suite defined before training. At least 30–100 real test cases, clear scoring criteria (rule-based or LLM-as-judge), side-by-side against the base model. Training without eval leaves you not knowing whether you got better or worse.