By spring 2026 the sovereign AI stack looks nothing like the one vendors painted 18 months ago — and quite different from the one a typical pitch deck still draws. Over the past twelve months we’ve deployed systems in banks, insurers, industrial Mittelstand and public agencies. This is the distilled lesson: what works, what stops being fun fast, and which pieces have actually proven themselves.
1. The 2026 status quo
Three trends have consolidated. Open-weight models have grown up. Llama 3.x and the Mistral family now match or beat GPT-4 quality on most business workloads — provided you adapt them properly. Hosting in Germany is a commodity. Not because it’s suddenly cheap, but because there is now enough GPU capacity in Frankfurt, Nuremberg and Helsinki to actually pick from. BYOLLM is the default ask. European mid-market buyers no longer accept a solution that locks them to a single model.
What did not stick: fully autonomous agents without a human in the loop. Closed-source models for sensitive data. Vector databases as the only retrieval mechanism.
2. The four layers
The stack we now ship in nearly every project has four layers:
“Sovereignty is not a property of a single model. It is a property of the entire data path.”
- Inference layer — usually vLLM or TGI, sometimes Ollama for smaller setups. Container-based, Kubernetes-orchestrated, GPU pinning configured.
- Adapter layer — LoRA and QLoRA for domain adaptation. We only merge adapters into the base model when there is an operational benefit; otherwise we keep them separate.
- Retrieval layer — pgvector for most cases, Qdrant when the scale justifies it. Hybrid search with BM25 + dense retrieval, re-ranking with a cross-encoder. Pure vector search without re-ranking produces too many false positives.
- Orchestration layer — structured state machines, not “agent, figure it out!”. LangGraph or Pydantic-AI. Sometimes custom Python or Go when constraints demand it.
3. Which models actually deliver
By default in 2026 we run a three-size strategy:
- Small (3–7B): Llama-3.2 3B or Mistral-7B for fast classification, routing, simple extraction. Runs on a single RTX 6000 Ada.
- Medium (13–70B): Llama-3.3 70B for harder reasoning tasks, provided we have an A100/H100.
- Large (>100B): Only when the use case really demands it. Unnecessary in 80% of projects.
Mixtral 8x22B is a useful MoE middle ground when VRAM is tight — the sparsity helps.
4. Deployment patterns
Three patterns dominate:
Pattern A: On-premise, air-gapped. Banks, pharma, government. Customer-owned GPU infra, no internet egress. We ship containers and model weights as signed tarballs. Updates via verified tarballs.
Pattern B: EU-cloud managed. Hetzner GEX44, AWS Frankfurt, OVH. We operate the system on the customer’s behalf. SLA, backups, monitoring included. Data never leaves the EU — contractually with AWS Frankfurt, by default with Hetzner (Germany/Finland).
Pattern C: BYOLLM bridge. Customer already has an internal Llama cluster or an Azure-OpenAI tenant. We connect to it without adding our own model hosting alongside.
5. Data paths that pass an audit
The hardest test in 2026 is an external auditor walking through your data flows. For a year now we’ve recommended three things:
- Structured audit logs at every model boundary. Every prompt, every completion — with hashes instead of cleartext when personal data may be present.
- Input filters in front of every LLM call. Regex plus a small classifier for PII.
- Output validation with JSON Schema or Pydantic. Structured outputs are no longer optional in 2026 — they’re the default.
6. The honest cost math
On-premise becomes cheaper than API above a certain volume — but the crossover point is higher than vendors claim. A rough 2026 rule of thumb:
- Under 100,000 tokens/day: APIs are cheaper.
- 100,000 — 1M tokens/day: managed EU cloud starts paying off.
- Over 1M tokens/day: your own infrastructure begins to win.
What flips this math: regulatory requirements. If the data is not allowed to leave the building, the economic argument is settled — the conversation is then about architecture, not about price.
7. Verdict
The 2026 stack is quieter, more boring and more honest than the 2024 hype made it look. Open-weight + EU hosting + clean data paths + structured outputs. That’s the recipe. Anyone selling you something else is selling you something we’ve already tried and abandoned.
If you want to know which of these layers gives you the biggest lever in your specific case: talk to us.