Skip to content

// journal / llm-deep-tech / instruction-tuning-rlhf-dpo

Instruction Tuning, RLHF and DPO: Aligning LLMs With Human Preferences

A raw pretrained LLM isn't a useful assistant. Instruction tuning, RLHF, DPO and newer methods make it one. How these methods work, what differentiates them, and what trade-offs exist — for teams that want to train themselves.

By createIF Labs
Published on
  • RLHF
  • DPO
  • Instruction tuning
  • Model adaptation & training
  • Alignment
Pipeline: pretraining → SFT → preference dataset → RLHF or DPO → aligned model
Visualization of the typical modern LLM pipeline: pretraining on raw text, followed by supervised fine-tuning (SFT) on instruction data, then producing a preference dataset (chosen/rejected pairs). From these the model is aligned to human preferences either via RLHF (with reward model and PPO) or via DPO (directly, no separate reward model).

A raw pretrained LLM is impressive but not directly usable. It can continue text but doesn’t reliably answer questions, follow instructions, or avoid harm. Turning it into a useful, safe assistant happens through alignment — a training pipeline of instruction tuning, preference data, and optimization methods like RLHF or DPO. This article explains the methods for teams that want to align models themselves.

1. Why a raw LLM isn’t directly usable

In pretraining an LLM learns to predict the next token — given billions of texts from web, books, and code. The result is powerful: world knowledge, language understanding, implicit reasoning. But it didn’t learn to follow instructions. Ask it for a summary, and it might respond — or just continue the text, since pretraining texts often do.

Three properties are missing in a raw model:

  • Instruction following. Actually answer the question, not keep talking.
  • Helpfulness. Concrete and useful, not evasive.
  • Safety. Refuse harmful requests, reduce hallucinations.

These are trained during alignment.

2. Instruction tuning (SFT) — the foundation

Supervised fine-tuning (SFT) is the first step: the model is trained on example pairs of instruction and desired response. Datasets like Alpaca, Dolly, OpenOrca or UltraChat have hundreds of thousands of such examples.

Effect:

  • The model follows instructions.
  • It answers in the expected format.
  • It adopts the style and tone of the training data.

Important hyperparameters: relatively low learning rate, few epochs (1–3), cross-entropy loss on the response portion (not the instruction). With LoRA, SFT is doable on a single GPU — details in LoRA explained.

SFT alone is often not enough. The model learns to follow but lacks mechanisms to avoid unwanted outputs. That needs preference data.

3. RLHF — Reinforcement Learning from Human Feedback

RLHF was the original breakthrough behind InstructGPT and later ChatGPT. Three phases:

  1. SFT. As above.
  2. Train a reward model. Humans rate answer pairs (which is better?). A reward model learns to score answers numerically.
  3. PPO optimization. The LLM is trained via Proximal Policy Optimization to produce answers the reward model rates highly — with regularization preventing too much drift from the SFT model.

Pros: High quality, robust, well established. Cons: Complex (three training phases), unstable (PPO is sensitive), expensive (two models parallel in memory).

RLHF remains relevant for frontier models in 2026 but is overkill for most enterprise applications.

4. DPO and its relatives

DPO (Direct Preference Optimization), introduced in 2023, skips the separate reward model. From preference pairs a direct loss is constructed, trained against the SFT and reference model. Mathematically elegant — empirically comparable to RLHF on many tasks, often better.

Pros:

  • Easier to implement.
  • More stable training, no PPO tuning.
  • Less memory (no separate reward model).

Related methods in 2026:

  • IPO. Identity Preference Optimization, more robust to label noise.
  • KTO. Kahneman-Tversky Optimization, learns from single ratings instead of pairs.
  • ORPO. Odds Ratio Preference Optimization, combines SFT and preference in one step.
  • SimPO. Simple Preference Optimization, fewer hyperparameters.

For most enterprise applications DPO is the pragmatic standard. Those chasing the last quality percent can try ORPO or SimPO.

5. Constitutional AI and RLAIF

Human feedback is expensive and doesn’t scale. Constitutional AI (Anthropic) and RLAIF (Reinforcement Learning from AI Feedback) partly replace human annotators with other models rating answers against a written “constitution” — a list of principles like “Be helpful. Avoid harm. Be honest.”

Pros:

  • Scalable.
  • Transparent (principles are explicit).
  • Iterative (principles can be adjusted without new annotations).

Useful for enterprises: your own constitution can encode domain-specific values and constraints (e.g. sector-specific compliance). That makes alignment more precisely adjustable than pure human feedback.

6. Practice: your own alignment pipeline

A productive mini-pipeline for own use cases:

  1. Choose a base model. Open weights with commercially friendly license (Llama 3.x, Mistral, Qwen).
  2. Build an SFT dataset. 1,000–10,000 high-quality instruction-response pairs from your domain.
  3. SFT with LoRA. One to two iterations. Already noticeable quality jumps.
  4. Generate preference data. Sample two answers per prompt from the SFT model, have experts rate them (or a stronger model under eval control).
  5. DPO. 1,000–5,000 preference pairs are often enough. LoRA-DPO fits on one GPU.
  6. Eval. Domain-specific tests, side-by-side vs. base. See Guardrails, evals and prompt injection.

Synthetic data can accelerate this pipeline — see Synthetic data.

7. Limits and open questions

Alignment has limits:

  • Goodharting. When optimizing on a metric, the model can exploit gaps without truly improving.
  • Reward hacking. Models can learn to fool the reward model instead of producing genuinely good answers.
  • Helpfulness vs. harmlessness. Trade-off: a more helpful model is often more exploitable by abuse.
  • Distribution shift. Alignment on training data doesn’t transfer perfectly to real applications.
  • Interpretability. Even aligned models stay black boxes — see Mechanistic interpretability.

Still: own alignment is reachable and worthwhile in 2026. From an open-weight base model plus 5,000–20,000 high-quality training and preference examples emerges a domain-specific assistant that speaks your language, knows your rules, and respects your refusal criteria — fully under your control. That’s the operational answer to vendor lock-in and closed-source black boxes.

// FAQ

Frequently asked questions.

  1. / 01What distinguishes instruction tuning from pretraining?

    Pretraining trains the model on raw text — Wikipedia, books, code, web. Result: a model that can continue sentences but doesn't follow orders. Instruction tuning (SFT) then trains the model on examples of instruction and desired response so it learns to follow. Only after this is the model a usable assistant.

  2. / 02What does RLHF mean concretely?

    Reinforcement Learning from Human Feedback: humans rate model answers pairwise (which is better?). A reward model is trained from these ratings. Then the LLM is optimized via reinforcement learning (typically PPO) to maximize the reward model — i.e., to produce answers humans prefer.

  3. / 03Why is DPO simpler than RLHF?

    DPO (Direct Preference Optimization) skips the separate reward model and the PPO phase. Instead the preference loss is applied directly to the LLM. Result: more stable training, easier to implement, often comparable quality to RLHF. Since 2024 the pragmatic standard.

  4. / 04Which open-source tools exist for alignment?

    Hugging Face TRL (Transformer Reinforcement Learning) for SFT, DPO, KTO, RLHF. Axolotl for end-to-end training. OpenRLHF for scaled RLHF. UnSloth for efficient single-GPU training. All production-ready for open-weight models in 2026.

  5. / 05Do I need lots of training data for own alignment?

    With DPO often surprisingly little: 1,000–10,000 preference pairs can suffice for a specific domain. SFT data: 500–50,000 examples. Quality beats quantity. High-quality, consistent data outperforms any volume of mediocre.

  6. / 06What is Constitutional AI?

    Anthropic's variant uses a written constitution (set of principles) instead of human feedback. A second model rates answers against these principles, producing the training dataset. Result: scalable alignment without the bottleneck of human annotation. Extended in RLAIF (Reinforcement Learning from AI Feedback).

// Read next

Read next