Skip to content

// journal / field-notes / rag-failure-modes

Seven Ways RAG Quietly Fails in Production

RAG looks great in a demo and often falls apart in production. Seven concrete failure modes we keep finding during audits — and how to test for them before they cost you trust.

By createIF Labs
Published on
  • RAG
  • Eval
  • Retrieval
  • Production
Diagram: seven failure modes of a RAG pipeline in production
Visualization of typical RAG failure modes in production pipelines: between document database and LLM, the retrieval pipeline silently breaks — through wrong chunk boundaries, embedding drift, missing re-ranking, overstuffed context, ignored metadata and hidden hallucinations. A systematic eval suite measures faithfulness, relevance, robustness, freshness and coverage, surfacing the gaps before trust is lost.

By 2026, retrieval-augmented generation has become the default architecture for domain-specific question-answering systems. It has a very good day in the workshop and very often a very bad month in production. Here are seven patterns we routinely find during audits — sorted by frequency.

1. Chunk boundaries cut meaning in half

Classic 512-token chunks slice through tables, separate definitions from their examples, and split list items. Our 2026 defaults:

  • Semantic chunking (e.g. using embedding-distance jumps as break points).
  • Overlap of 15–25%, not 5%.
  • Hierarchical chunking — each chunk knows which section it belongs to.

Test it: Is there a query whose answer spans two consecutive chunks? If yes, measure how often the second half is retrieved.

2. Embedding drift

Embeddings age. If you swap embedding models and only re-embed new documents, you’ve put two vector spaces into the same database. The similarity between old and new is undefined.

Hidden tell: you change the model and suddenly “old documents seem to perform worse.”

3. Missing re-ranking

Pure dense retrieval gives you plausible hits, but rarely the best ones. A cross-encoder re-ranker applied to the top-20 typically improves hit quality by 18–30% in our measurements. Without it, you’re leaving accuracy on the table.

4. No eval suite

We routinely find RAG systems in production with no reproducible evaluation. Translation: nobody knows whether the system is better or worse today than last week. Minimum:

  • 50 hand-labeled Q&A pairs per domain.
  • A reproducible eval run before every deploy.
  • Metrics: Hit@K, MRR, faithfulness, answer relevance.

Promptfoo, Inspect-AI or Ragas — anything. Just plug it in.

5. Overstuffed context

Stuffing more chunks into the context window sounds logical and makes things worse. Models measurably exhibit “lost-in-the-middle” — the relevant chunk at position 7 of 12 is often ignored. We rarely deliver more than 5 chunks to the model.

6. Ignored metadata

Documents have structure that gets lost in plain text: publication date, version, status, author. A RAG pipeline that doesn’t filter on those fields will happily answer from outdated contracts, deprecated versions, or draft documents. Structured filters in front of the vector search are not optional.

7. Hidden hallucinations

The most dangerous failure mode: the model fabricates a plausible-sounding answer that appears to reference the retrieved documents but distorts their content. Without a faithfulness eval, you will not notice.

Fix: systematically verify whether the answer is supported by the source documents. Ragas, RAGAS-Faithfulness, or a custom cross-encoder all work fine.


If you operate a RAG system in production and any of this feels uncertain: we run audits. Two days, a clear verdict, no work for your team.

// Read next

Read next