RAG Architecture Patterns 2026

When RAG fails, the failure point is retrieval, not generation. Proper retrieval pipelines reduce hallucinations by 70-90%. Most teams under-invest in retrieval quality relative to model selection.

The Foundation: Hybrid Search (Table Stakes)

Dense retrieval (semantic similarity) + sparse retrieval (BM25 keyword matching). Combining both consistently outperforms either alone. This is not optional — it’s the baseline.

Query → [Dense Embedding Search] + [BM25 Keyword Search]
           ↓                            ↓
        Semantic matches          Exact term matches
           ↓                            ↓
        ──────── Merge + Deduplicate ────────
                        ↓
              Reranking (cross-encoder)
                        ↓
                  Final results

The Production Pipeline (Three Stages)

Stage 1: Recall (Broad, Loose Threshold)

Vector search with low threshold (0.25-0.3)
BM25 keyword search in parallel
Merge via Reciprocal Rank Fusion (RRF)
Target: 50 candidates

Stage 2: Rerank (Precise)

Cross-encoder or ColBERT v2 re-scores all candidates
ColBERT stores per-token embeddings with late interaction scoring — near cross-encoder accuracy at near bi-encoder speed
Highest-ROI improvement most teams can make
Target: Top 10-20

Stage 3: Select (Strict, Dynamic Threshold)

Gap-based threshold: find natural breakpoint between relevant and irrelevant scores
Return only results above the gap
Target: 3-10 final results

Performance Impact (Cohere Data)

Method	Recall	Accuracy	User Satisfaction
Vector only	65%	78%	3.2/5
+ Keyword (hybrid)	82% (+26%)	81%	3.8/5
+ Reranking	89% (+37%)	91% (+17%)	4.3/5 (+34%)

Advanced Patterns

Contextual Retrieval (Anthropic)

Prepend chunk-specific context before embedding. Instead of embedding a raw chunk, prepend: “This chunk is from the authentication section of the API docs, discussing JWT token validation.”

Simple to implement, meaningful retrieval accuracy gains. Reduces the “lost in the middle” problem.

Agentic RAG

The dominant new enterprise pattern. Specialized agents handle different retrieval stages:

Query Decomposition Agent → breaks complex questions into sub-queries
     ↓
Retrieval Router Agent → directs each sub-query to relevant data subset
     ↓
[Parallel Retrieval Agents] → each searches their assigned data source
     ↓
Validation Agent → checks retrieved chunks for relevance and freshness
     ↓
Synthesis Agent → generates final answer with citations

Real and shipping at companies. Particularly effective for multi-source retrieval (docs + code + tickets + conversations).

GraphRAG

Uses knowledge graphs as the retrieval layer instead of flat vector stores. Enables multi-hop reasoning across entities and relationships.

"What projects did the team that built Feature X also work on?"
  → Entity: Feature X → Relationship: built_by → Team A
  → Entity: Team A → Relationship: built → [Project Y, Project Z]

Significantly better on complex analytical questions requiring relationship reasoning. Neo4j is the primary vendor. Production adoption is real but limited to use cases that genuinely need relationship reasoning (compliance, research, knowledge management).

Not a replacement for vector search — typically used alongside it.

Late Chunking

Instead of chunking documents first then embedding each chunk, embed the full document first (using a long-context embedding model), then extract chunk-level embeddings from the full-document representation. Preserves document-level context in every chunk embedding.

Threshold Strategies

Gap-Based (Google Research)

Find the natural “cliff” in similarity scores:

Doc A: 0.85 ┐
Doc B: 0.82 ├─ Relevant cluster
Doc C: 0.78 ┘
─── Gap ───── ← Natural breakpoint (0.78 - 0.45 = 0.33)
Doc D: 0.45 ┐
Doc E: 0.42 ├─ Irrelevant cluster

Set threshold at midpoint of the largest gap. Far more effective than fixed thresholds.

Adaptive by Query Type

Query Type	Threshold	Rationale
Definition (“What is X”)	0.6	Needs precision
Enumeration (“List all X”)	0.4	Needs recall
How-to (“How to do X”)	0.7	Needs accuracy
Exploratory (“About X”)	0.3	Needs breadth

Index Selection (pgvector)

Data Scale	Index	Config
<10K	None (sequential scan)	<100ms queries
10K-100K	HNSW	`m=16, ef_construction=64`
>100K	HNSW aggressive	`m=32, ef_construction=128`

HNSW is the industry standard — 99% accuracy at any scale, 15ms query time.

The Critical Insight for 2026

Late chunking, contextual embeddings, and hybrid search are higher-leverage improvements than swapping to a better LLM. When generation hallucinates, the root cause is almost always bad retrieval, not bad generation.

KahWei's Wiki

Explorer

RAG Architecture Patterns 2026

RAG Architecture Patterns 2026

The Foundation: Hybrid Search (Table Stakes)

The Production Pipeline (Three Stages)

Stage 1: Recall (Broad, Loose Threshold)

Stage 2: Rerank (Precise)

Stage 3: Select (Strict, Dynamic Threshold)

Performance Impact (Cohere Data)

Advanced Patterns

Contextual Retrieval (Anthropic)

Agentic RAG

GraphRAG

Late Chunking

Threshold Strategies

Gap-Based (Google Research)

Adaptive by Query Type

Index Selection (pgvector)

The Critical Insight for 2026

Sources

Graph View

Table of Contents

Backlinks

KahWei's Wiki

Explorer

RAG Architecture Patterns 2026

RAG Architecture Patterns 2026

The Foundation: Hybrid Search (Table Stakes)

The Production Pipeline (Three Stages)

Stage 1: Recall (Broad, Loose Threshold)

Stage 2: Rerank (Precise)

Stage 3: Select (Strict, Dynamic Threshold)

Performance Impact (Cohere Data)

Advanced Patterns

Contextual Retrieval (Anthropic)

Agentic RAG

GraphRAG

Late Chunking

Threshold Strategies

Gap-Based (Google Research)

Adaptive by Query Type

Index Selection (pgvector)

The Critical Insight for 2026

Related Pages

Sources

Graph View

Table of Contents

Backlinks