LoRA is the best-known parameter-efficient adaptation method, but far from the only one. Under the umbrella PEFT — parameter-efficient fine-tuning — sit several methods that adapt LLMs without changing the billions of model weights. Anyone looking for the right setup for their company should know the options. This article explains them.
1. Why parameter-efficient fine-tuning
Full fine-tuning of a 70B model needs hundreds of GB of VRAM and days of GPU time. For most enterprise applications that’s overkill — in cost and risk (catastrophic forgetting). PEFT solves this: only a tiny fraction of parameters is trained, the base model stays unchanged. Result: training on a single GPU, multiple specializations on one base model, fast iteration.
Motivation in depth in LoRA explained and Fine-tuning when worth it.
2. Classical adapters
The original idea before LoRA: insert small extra layers between existing Transformer blocks — adapter modules. They typically consist of a down-projection, a nonlinearity, and an up-projection. Only the adapter layers are trained.
Pros:
- Modular: adapters can be stored and combined separately.
- Preserves base model fully.
Cons:
- Latency overhead: extra layers in every forward pass.
- Harder to merge than LoRA.
Classical adapters (Houlsby, Pfeiffer) in 2026 only matter in special cases — most setups choose LoRA as the superior variant.
3. LoRA and variants
LoRA places adapter logic parallel to the weight matrices instead of between layers: two small matrices A and B approximate the update matrix as a low-rank product. After training, A·B can be merged into the base weights — zero inference overhead.
Important variants:
- LoRA. Standard form, rank typically 8–64.
- DoRA. Decomposes magnitude and direction of the update, trained separately. Slightly better quality.
- rsLoRA. Scales alpha proportionally to the square root of rank — more stable training at high rank.
- PiSSA. Initializes A and B with the SVD of the base matrix, faster convergence.
- VeRA. Shares A and B across layers, massively reduces parameter count.
For most use cases standard LoRA is enough. Variants bring 1–5% quality at higher complexity — useful in mature setups.
4. Prompt tuning and prefix tuning
These methods learn continuous “soft prompts” — sequences of vectors prepended to the input.
Prompt tuning. A sequence of 20–100 learnable embedding vectors is prepended to every input. Only these vectors are trained; the rest of the model stays frozen.
Prefix tuning. Extends the concept: learnable vectors are added not only at input but in every Transformer layer. More parameters, more capacity.
Pros:
- Extremely few parameters (often under 1 million).
- Very modular, fast to load.
- No model-internal access required (works in limited form even with closed APIs).
Cons:
- Limited adaptation capacity — good for small behavior changes, worse for deep specialization.
- Sensitive to hyperparameters.
Attractive for small tasks (tone, classification). Inferior for demanding adaptations.
5. IA³ — minimalist scaling
IA³ (Infused Adapter by Inhibiting and Amplifying Inner Activations) is the minimalist variant: only per-layer scaling vectors are learned — one vector per layer that multiplies activations element-wise.
Pros:
- Extremely few parameters (typically under 0.01% of the model).
- Fast to train.
- Modular and composable.
Cons:
- Smallest adaptation capacity of all PEFT methods.
- Clearly inferior to LoRA on complex tasks.
Use cases: very similar tasks, fast personalization, multi-task setups with minimal overhead.
6. Which method when?
Pragmatic recommendations from practice:
- LoRA / QLoRA. Default for 95% of cases. Best balance of adaptation capacity, efficiency, mergeability.
- Classical adapters. Only when an existing setup historically already uses them.
- Prompt tuning. When a closed-API model is used and only light behavior change is needed.
- Prefix tuning. Rare — when depth is needed, LoRA usually wins.
- IA³. Multi-task setups with very similar tasks and demand for extreme modularity.
Rule of thumb: when unsure, pick LoRA with rank 16, alpha 32 and standard hyperparameters. That works in the vast majority of cases without further tuning.
7. Practice: PEFT with Hugging Face
The Hugging Face PEFT library implements all methods discussed here with a unified API. Example for LoRA:
from peft import LoraConfig, get_peft_model, TaskType
config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=16,
lora_alpha=32,
lora_dropout=0.05,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
peft_model = get_peft_model(base_model, config)
For other methods there are analogous configs (PrefixTuningConfig, PromptTuningConfig, IA3Config). The Hugging Face PEFT docs are in 2026 the pragmatic starting point for every PEFT project.
PEFT in 2026 is no longer a research playground but the standard approach for any model adaptation beyond trivial prompt tweaks. Sticking with LoRA as default rarely steers you wrong — and knowing the variants lets you pick the right method for special needs. More important than method choice remains data curation, eval pipeline, and structured iteration. More in Fine-tuning when worth it.
Frequently asked questions.
/ 01Is PEFT the same as LoRA?
No. PEFT (parameter-efficient fine-tuning) is the umbrella term. LoRA is a specific PEFT method — currently the most popular by far. Other PEFT methods are classical adapters, prefix tuning, prompt tuning, and IA³. Each has its own trade-offs.
/ 02When is classical adapter tuning better than LoRA?
Rarely. Classical adapters (Houlsby, Pfeiffer) tend to add latency since they insert extra layers into every forward pass. LoRA adapters can be merged into the base model after training — zero latency overhead. For most use cases LoRA wins.
/ 03What's the difference between prompt tuning and prompt engineering?
Prompt engineering writes human-readable prompts. Prompt tuning learns a soft prompt — a sequence of continuous vectors prepended to the input and optimized during training. It isn't readable as text but can precisely steer model behavior without changing model weights.
/ 04Which PEFT method has the fewest parameters?
Prompt tuning, often only thousands to hundreds of thousands of trainable parameters (vs. millions in LoRA). IA³ sits similarly low. But both have less adaptation capacity than LoRA — fit for small behavior changes, not deep specialization.
/ 05Can I combine multiple PEFT methods?
Yes, that's a research topic (Hybrid-PEFT, MAM Adapter, AdapterFusion). In practice LoRA alone is usually enough. Combining LoRA with prefix tuning can squeeze the last percent of quality — at noticeably higher complexity.
/ 06How does PEFT save memory?
PEFT methods freeze the base model and learn only small additional parameters. That saves optimizer state (typically 2× the weights themselves) and gradients. With QLoRA, PEFT combines with quantization for further savings — see QLoRA and quantization.