What is hybrid search concretely?

Hybrid search combines two or more retrieval methods: dense embedding search (semantic similarity) and sparse search like BM25 (exact word match). Both have complementary strengths — embeddings find synonyms, BM25 catches exact names, codes, jargon. Together they're much more precise.

How does reranking work?

After initial search (top-50 or top-100 hits) a second, more precise model — typically a cross-encoder — re-orders the hits. Cross-encoders process query and hit together and can make finer relevance decisions than pure embedding search. They're slow but only needed on the already-filtered top list.

GraphRAG combines retrieval with a knowledge graph — a structured representation of entities and relationships. Instead of just fetching text chunks, the system navigates relationships: who is connected to whom, what links to what. Especially useful for complex questions spanning multiple entities and connections.

When is GraphRAG worthwhile?

When the domain is strongly networked — legal (clause-clause references), science (citation networks), compliance (rule chains), engineering (dependencies). When answers require hops across several relationships ("Which contracts with suppliers in region X were renewed in 2025?"), GraphRAG often delivers significantly better results than pure embedding RAG.

What does a reranker cost?

Cross-encoders are roughly 10–100× slower than pure embedding search. On a top-100 that's typically 50–200 ms — acceptable. Models like bge-reranker-v2-m3 or Cohere Rerank are production-ready in 2026. On-premise a reranker needs extra inference resources, often a small GPU.

How do I measure retrieval quality?

With an eval set of real questions and expected source documents. Metrics: Recall@K (how often the right source is in top-K), MRR (mean reciprocal rank), nDCG. Without an eval set, optimization is gambling. More in Guardrails, evals and prompt injection.

Hybrid Search, Reranking & GraphRAG: Precise AI Search (2026)

A RAG solution is never better than its retrieval layer. When search returns wrong sources, the LLM answers wrongly. By 2026 clear best practices exist for dramatically improving retrieval quality — beyond pure vector search. This article covers the three most important primitives: hybrid search, reranking, and GraphRAG.

1. Why pure vector search rarely suffices

Embedding search is a revolution over classical full-text — it finds semantically similar content. But it has weaknesses:

Exact terms. Product codes, person names, technical identifiers. Embeddings can miss them because they’re rare.
Ambiguity. Synonymous answers get found even when the user meant a specific term.
Very rare concepts. Things rarely seen in pretraining yield unreliable embeddings.
Entity relationships. Embeddings understand meaning, not structure.

More on embeddings themselves in Embeddings and vector databases. This article covers what becomes necessary on top.

2. Hybrid search — dense and sparse

Hybrid search combines two retrieval methods:

Dense search with embeddings: semantic similarity, good for paraphrased questions.
Sparse search with BM25: exact word matching, good for names, codes, jargon.

Both run in parallel on the same corpus. A fusion layer combines results. Outcome: embedding recall plus BM25 precision. In typical enterprise RAG, hybrid search yields 10–25% better Recall@10 than pure embedding search.

Vector databases like pgvector, Qdrant, Weaviate support hybrid search natively. Building it manually combines Postgres full-text or OpenSearch with pgvector or Qdrant.

3. Fusion: RRF and weighted combination

How to combine two ranked lists?

Reciprocal Rank Fusion (RRF). Each hit gets a score from its rank in each list: score = 1/(k + rank). Hits ranked high in both lists score highest overall. Very robust, no weight tuning needed.
Weighted score combination. Score = α·dense_score + (1-α)·sparse_score. Requires normalization since embedding and BM25 scores live in different ranges. α tuning needed.
Learned fusion. A small model learns the combination from examples. Very precise, more training overhead.

RRF is the pragmatic 2026 standard — robust, simple, good results without tuning.

4. Reranking with cross-encoders

Embedding search scales to millions of documents but is approximate — it compares query and document independently (bi-encoder). Cross-encoders process query and document jointly for far finer relevance decisions — but they’re slow.

Solution: two-stage retrieval.

First stage: Embedding or hybrid search, top-100 hits.
Second stage: Cross-encoder reranker re-orders the top-100, returns top-5 or top-10.

Productive rerankers in 2026:

bge-reranker-v2-m3. Open weights, multilingual, good quality.
Cohere Rerank v3. Closed API, very high quality.
Jina Reranker. Open source and cloud, multilingual.

A reranking step typically adds 10–20% quality — on an already optimized hybrid base. It’s the standard lever for enterprise RAG once first steps are taken.

5. GraphRAG and knowledge graphs

Some queries demand more than similarity — they demand relationships. Which contracts with suppliers in region X were renewed in 2025? Which clauses conflict with clause 4.2? These are hard for embedding search because the relevant connections must be made explicit.

GraphRAG combines retrieval with a knowledge graph — a structured representation of entities (contracts, clauses, persons, products) and relationships (renewed-by, conflicts-with, supplied-to).

Architectures in 2026:

Microsoft GraphRAG. Builds knowledge graphs automatically from documents via LLM extraction. Open source.
Cognee. Similar approach, focused on memory graphs for agents.
Hand-curated graphs. If the domain is clearly structured (legal, compliance, engineering), your own graph beats LLM extraction.

Backends: Neo4j, ArangoDB, FalkorDB (Redis-based), Postgres with Apache AGE. Choice depends on volume and existing infrastructure.

GraphRAG is more effort than plain embedding RAG but pays off where relationships matter more than content.

6. Agentic retrieval — adaptive search

An additional lever in 2026: instead of one search query, an agent performs several adaptive ones.

Query decomposition. Complex questions split into subqueries, each searched separately.
Hypothesis generation (HyDE). The LLM generates a hypothetical answer whose embedding is used for search — often more precise than the original question.
Iterative retrieval. Initial results motivate further queries until enough context is gathered.

These techniques cost extra tokens and latency but pay off on hard questions. Reasoning models (see Reasoning models) are particularly good at this.

7. Practice: building a productive pipeline

Recommended steps:

Base pipeline with hybrid search. Embeddings + BM25, RRF fusion. Immediate quality lift over plain vector.
Eval suite. 50–200 real questions with expected sources. Measure Recall@10, MRR.
Add reranker. Cross-encoder on top-100. Measure eval.
Iterate chunking. Size, overlap, strategy. Often the biggest lever.
GraphRAG if needed. For strongly networked domains.
Agentic retrieval when queries are complex. With a reasoning backbone.

Modern AI search in 2026 is a discipline with clear best practices and a productive tool landscape. Using pure vector search leaves 20–50% quality on the table. Combining hybrid search, reranking, and possibly GraphRAG builds RAG systems that hold up in enterprise daily life. The key discipline remains the eval pipeline — without it, every optimization is gambling. With it, retrieval quality becomes a calculable engineering quantity. More on operations in LLMOps.

Hybrid Search, Reranking and GraphRAG: How Modern AI Search Gets Precise