Training a big model is expensive. Running a big model in production is even more expensive — every day, every token. Model distillation solves that: it transfers the capabilities of a large teacher model to a smaller, cheaper student model. What began as a research technique in 2018 is in 2026 a productive strategy for anyone running LLMs at volume. This article explains when the effort really pays off.
1. Why distillation
Three drivers:
- Inference cost. A 70B model is 5–10× more expensive per token than a 7B. With millions of daily queries that quickly becomes a four- to six-figure monthly line item. More in LLM inference.
- Latency. A small model answers in 100–500 ms. A big one needs 1–5 seconds. For interactive applications that’s a decisive UX gap.
- Hardware sovereignty. A 7B model runs on a single consumer GPU or even a Mac. A 70B needs data-center hardware. For on-premise and edge deployments this is a major difference.
2. How distillation works
The basic principle is simple:
- Teacher generates data. The large model answers a set of inputs — produces answers, reasoning traces, explanations.
- Training dataset from teacher outputs. Input + teacher answer becomes a training pair.
- Student is trained on it. A smaller model learns to reproduce the teacher’s answers — usually with standard cross-entropy loss or additional loss terms.
The inputs can come from real data or be self-generated (prompts synthesized to diversify the dataset). More in Synthetic data.
3. Three variants: soft, hard, behavioral
Soft distillation. The student learns not only the final answer but the teacher’s full probability distribution over tokens. Transfers finer nuance. Prerequisite: access to teacher logits — often unavailable with closed-API teachers.
Hard distillation. The student learns only the final token sequence. Works with closed-API teachers since no logits are needed. Slightly lossier, but practical.
Behavioral / black-box distillation. Extends hard distillation with behavioral aspects: the student is trained to follow the teacher’s multi-turn interactions, tool-use patterns, or reasoning traces. Especially relevant for agents and reasoning models. See Reasoning models.
4. Practical examples in 2026
- OpenAI o1-mini, o3-mini. Distillates of their larger reasoning models, much cheaper, comparable on many tasks.
- DeepSeek-R1 distillates. Open-weight models (1.5B, 7B, 14B, 32B) bringing R1 reasoning behavior into smaller models. Freely available.
- Llama-3.1-8B as workhorse. Many companies distill GPT-4 answers into 8B Llamas for specific domains — on-premise capable, cheap in production.
- Code distillation. Models like DeepSeek-Coder-V2-Lite or Qwen-Coder distillates enable local coding assistants without API calls.
5. When distillation pays off
A pragmatic heuristic: distillation pays off when:
- High request volumes. Above roughly 100,000 requests per month the training cost amortizes quickly.
- Bounded domain. When the task covers a narrow area (support questions, code in a framework, classification), the student can almost fully imitate the teacher.
- Existing eval suite. Without eval, success isn’t measurable. With eval, distillation is controllable.
- License question clarified. Open-weight teachers (Llama, Mistral, DeepSeek) are much less problematic than closed-API teachers.
When request volumes are low or the task is very open, distillation often doesn’t pay off — training costs don’t amortize.
6. Typical pitfalls
- Bad data curation. A teacher can hallucinate or err. Those errors are learned by the student. Filter and validation pipeline mandatory.
- Student too small. If the student is too small it can’t reproduce the teacher’s behavior. Check minimum size.
- Lack of dataset diversity. Narrow datasets hurt generalization. Synthetic diversification helps.
- Missing edge-case eval. A distilled model can shine on standard inputs and fail on edge cases. Eval must include edge cases — see Guardrails, evals and prompt injection.
- License question forgotten. Using outputs of a commercial API to train a competing open-source model can be legally problematic.
7. Recommended approach
A proven four-step path:
- Define use case and eval. What should the student do? What metrics?
- Generate teacher dataset. 50,000–500,000 examples, ideally diversified. Filter for consistency and correctness.
- Train student with LoRA or full fine-tuning. Depending on model size and hardware. Side-by-side vs. teacher.
- Eval, iterate, deploy. Clear acceptance criteria. If the student stays too weak: larger student or better teacher dataset.
Model distillation in 2026 is no longer experimental but a cost-efficient standard strategy for productive volume LLM setups. Running big models in production without considering distillation leaves double-digit efficiency on the table. Doing it structurally — with clean data, eval, and license clarity — builds a long-term operable AI stack at a fraction of vendor costs.
Frequently asked questions.
/ 01What's the difference between distillation and fine-tuning?
Fine-tuning adapts a model to a task using human-labeled or curated data. Distillation trains a smaller model to imitate a larger model's answers — the training data comes from the teacher model, not humans. Distillation is a special form of fine-tuning with synthetic data.
/ 02How much quality does a distilled model lose?
Heavily task- and size-dependent. On well-bounded tasks, a 7B student can reach 90–98% of a 70B teacher's quality. On open or reasoning-heavy tasks the loss is larger — often 5–15%. An eval suite is mandatory to measure the actual loss.
/ 03Which models work as teachers?
Any model whose answers are qualitatively convincing. In practice, large frontier models (GPT-4 class, Claude, Gemini, DeepSeek-V3, Llama-405B). Ensemble setups with multiple teachers imitated by one student are also possible.
/ 04Is distillation legally problematic?
Yes, a serious question. Many commercial APIs prohibit using their outputs to train competing models. Open-weight models like Llama, Mistral, and DeepSeek usually allow it explicitly. License terms must be checked before every distillation project.
/ 05Can distillation be combined with LoRA?
Yes, a popular combination. Distillation produces a strong base student. Then LoRA fine-tunes for specific domains or use cases. Details in LoRA explained.
/ 06How much training effort does distillation need?
Lower than pretraining, higher than pure fine-tuning. Typically 50,000–500,000 teacher-generated examples. GPU hours for student training: similar to a normal fine-tune. Main cost: API calls or inference hours on the teacher for data generation.