Seven Ways RAG Quietly Fails in Production

By 2026, retrieval-augmented generation has become the default architecture for domain-specific question-answering systems. It has a very good day in the workshop and very often a very bad month in production. Here are seven patterns we routinely find during audits — sorted by frequency.

1. Chunk boundaries cut meaning in half

Classic 512-token chunks slice through tables, separate definitions from their examples, and split list items. Our 2026 defaults:

Semantic chunking (e.g. using embedding-distance jumps as break points).
Overlap of 15–25%, not 5%.
Hierarchical chunking — each chunk knows which section it belongs to.

Test it: Is there a query whose answer spans two consecutive chunks? If yes, measure how often the second half is retrieved.

2. Embedding drift

Embeddings age. If you swap embedding models and only re-embed new documents, you’ve put two vector spaces into the same database. The similarity between old and new is undefined.

Hidden tell: you change the model and suddenly “old documents seem to perform worse.”

3. Missing re-ranking

Pure dense retrieval gives you plausible hits, but rarely the best ones. A cross-encoder re-ranker applied to the top-20 typically improves hit quality by 18–30% in our measurements. Without it, you’re leaving accuracy on the table.

4. No eval suite

We routinely find RAG systems in production with no reproducible evaluation. Translation: nobody knows whether the system is better or worse today than last week. Minimum:

50 hand-labeled Q&A pairs per domain.
A reproducible eval run before every deploy.
Metrics: Hit@K, MRR, faithfulness, answer relevance.

Promptfoo, Inspect-AI or Ragas — anything. Just plug it in.

5. Overstuffed context

Stuffing more chunks into the context window sounds logical and makes things worse. Models measurably exhibit “lost-in-the-middle” — the relevant chunk at position 7 of 12 is often ignored. We rarely deliver more than 5 chunks to the model.

6. Ignored metadata

Documents have structure that gets lost in plain text: publication date, version, status, author. A RAG pipeline that doesn’t filter on those fields will happily answer from outdated contracts, deprecated versions, or draft documents. Structured filters in front of the vector search are not optional.

7. Hidden hallucinations

The most dangerous failure mode: the model fabricates a plausible-sounding answer that appears to reference the retrieved documents but distorts their content. Without a faithfulness eval, you will not notice.

Fix: systematically verify whether the answer is supported by the source documents. Ragas, RAGAS-Faithfulness, or a custom cross-encoder all work fine.

If you operate a RAG system in production and any of this feels uncertain: we run audits. Two days, a clear verdict, no work for your team.