Skip to content

// journal / llm-deep-tech / llm-fine-tuning-lohnt-sich

Fine-Tuning LLMs: When Is It Worth Adapting a Model?

Fine-tuning promises a domain-specific model — but costs time, money and data work. When the effort really pays off: cost calculation, data requirements, eval mandate, risks and hard decision rules from practice.

By createIF Labs
Published on
  • Fine-tuning
  • Model adaptation & training
  • Cost calculation
  • Data quality
  • Eval
Decision tree: when LLM fine-tuning, when RAG, when prompt engineering
Structured decision tree starting from a concrete use case, systematically asking about data quality, data volume, type of adaptation (behavior vs. knowledge), hardware budget, and eval maturity. Paths end with concrete recommendations: prompt engineering, RAG, LoRA, full fine-tuning or distillation.

Fine-tuning sounds like sovereignty: your own model that speaks your language, knows your data, runs on your infrastructure. In practice, many fine-tuning projects fail — not on the tech, but on wrong expectations, bad data, and missing evaluation. This article shows when fine-tuning is really the right lever, and when other methods reach the goal cheaper and more reliably.

1. Why fine-tuning often overpromises

Fine-tuning is often pitched as a universal solution: “We train a model on your data, then it knows your business.” In reality, fine-tuning first means a lot of work, money, risk — and a set of prerequisites without which the result can’t be productively used.

The most common error is confusing knowledge with behavior. Fine-tuning is not the right way to teach a model facts — that’s what RAG is for. Fine-tuning is the right way to teach a model style, format and domain language. Confusing these burns budget.

2. What fine-tuning really changes

Fine-tuning modifies the weight matrix of an LLM — the internal parameters that drive output behavior. This change affects:

  • Style and tone. A model fine-tuned on legal texts writes in legally correct phrasing.
  • Format and structure. A model trained consistently on JSON output adheres to the format much more reliably.
  • Domain-specific vocabulary. Terms rare in pretraining are used more precisely.
  • Classification behavior. On tasks with defined categories, accuracy can be substantially raised.

What fine-tuning does not reliably change:

  • Factual knowledge in areas where the model saw little pretraining.
  • Context understanding over very long documents.
  • Truthfulness — fine-tuning does not eliminate hallucinations.

3. Five hard prerequisites

From our consulting practice: if any of these is missing, fine-tuning should be postponed.

  1. Clearly defined use case. One task, one input shape, one expected output. “Better answers” is not a use case.
  2. Your own data in sufficient quality and quantity. At least several hundred high-quality examples, cleanly labeled, consistently formatted.
  3. Eval suite before training. At least 30 real test cases with scoring criteria. Without eval, fine-tuning is blind flight.
  4. Hardware plan. GPU access (local or cloud), defined budgets, privacy posture clarified.
  5. Engineering discipline. Reproducible pipelines, versioned datasets, logging. A bash script doesn’t cut it.

4. What does fine-tuning realistically cost?

Three cost drivers:

  • Data work. The biggest line item — typically 60–80% of total effort. Collecting, cleaning, labeling, validating training data.
  • GPU hours. LoRA on 8B models: 10–100 euros per iteration. Full fine-tuning on 70B models: 1,000–10,000 euros per iteration. Multiple iterations are the rule.
  • Eval and deployment. Building the eval suite, packaging the model, setting up monitoring, defining rollback. Time, not hardware.

A first productive LoRA iteration is realistic at 5,000–25,000 euros total. Full fine-tuning of a 70B model with curated dataset and eval lands closer to 80,000–300,000 euros. The math only works if the added value is clearly measurable.

5. Fine-tuning vs. RAG

A pragmatic heuristic:

  • If the problem is “answers wrongly because knowledge is missing” → RAG.
  • If the problem is “answers stylistically wrong or formally inconsistent” → fine-tuning.
  • If both apply → try RAG first, add fine-tuning where needed.

In most enterprise setups RAG solves 60–80% of the issues originally labeled “fine-tuning needed.” Only the remaining 20–40% justify fine-tuning’s effort. See also RAG, fine-tuning or prompt engineering.

6. Risks and pitfalls

  • Bad data quality. Fine-tuning on inconsistent, faulty or biased data produces a model that systematically reproduces those flaws. More in Why AI projects fail.
  • Catastrophic forgetting. Especially in full fine-tuning, the model loses general capabilities. LoRA mitigates this risk.
  • Overfitting. Too small or too homogeneous datasets cause the model to memorize the training data and fail on new examples.
  • Becoming outdated. Facts correct today are outdated next year. Fine-tuning on such content locks you into a re-training cycle.
  • License and compliance risk. Open-weight models have different licenses (Llama, Apache, MIT, restricted). Clarify before fine-tuning whether the intended use is allowed.

7. Three hard decision rules

Crystallized from 50+ consulting cases:

  1. No fine-tuning investment without an eval suite. Eval is not optional. Without it, success isn’t measurable.
  2. LoRA before full fine-tuning. LoRA is enough in 90% of cases. Full fine-tuning is the exception. Details in LoRA explained.
  3. RAG before fine-tuning. When both seem viable, RAG first — cheaper, more flexible, more maintainable.

Fine-tuning in 2026 is not a secret weapon but a precise tool for a clearly bounded task. Used as a universal solution it burns budget. Used surgically, where preconditions hold, it builds a real competitive advantage — a model that speaks your language, runs on your infrastructure, and improves iteration by iteration. The path leads through clean data, sharp use cases, and a mature eval pipeline — not through vendor pitches.

// FAQ

Frequently asked questions.

  1. / 01How many training examples do I need for fine-tuning?

    For LoRA fine-tuning, 500–5,000 high-quality examples are often enough. For full fine-tuning, 10,000–100,000 examples are typical. Quality beats volume: well-labeled, representative, consistently formatted data outperforms any dataset size.

  2. / 02How do I tell if fine-tuning or RAG fits?

    Rule of thumb: RAG for knowledge, fine-tuning for behavior. If you need to provide frequently changing facts, RAG. If you want to enforce consistent tone, format, domain language or structured outputs, fine-tuning. Details in RAG, fine-tuning or prompt engineering.

  3. / 03Is fine-tuning worthwhile for a small business?

    With LoRA: yes, often very much. Hardware requirements are moderate (one GPU suffices), and costs are a few hundred to a thousand euros per iteration. Preconditions: clearly defined task, your own example data, and an eval strategy.

  4. / 04How quickly does fine-tuning become outdated?

    Behavioral adaptations (style, format, domain language) age slowly — often usable for years. Knowledge adaptations age fast because the model learns facts that change. The latter is the most common reason fine-tuning projects fail: knowledge was trained instead of behavior.

  5. / 05Do I need my own GPUs for fine-tuning?

    No. Cloud providers like Hetzner, Together AI, Modal or Lambda Labs rent GPUs by the hour. A LoRA iteration typically costs 10–50 euros. If you work with sensitive data, choose German or European providers — see Secure AI integration.

  6. / 06How do I evaluate whether my fine-tuning succeeded?

    With an eval suite defined before training. At least 30–100 real test cases, clear scoring criteria (rule-based or LLM-as-judge), side-by-side against the base model. Training without eval leaves you not knowing whether you got better or worse.

// Read next

Read next