RAG Architecture Patterns 2026

When RAG fails, the failure point is retrieval, not generation. Proper retrieval pipelines reduce hallucinations by 70-90%. Most teams under-invest in retrieval quality relative to model selection.

The Foundation: Hybrid Search (Table Stakes)

Dense retrieval (semantic similarity) + sparse retrieval (BM25 keyword matching). Combining both consistently outperforms either alone. This is not optional — it’s the baseline.

Query → [Dense Embedding Search] + [BM25 Keyword Search]
           ↓                            ↓
        Semantic matches          Exact term matches
           ↓                            ↓
        ──────── Merge + Deduplicate ────────
                        ↓
              Reranking (cross-encoder)
                        ↓
                  Final results

The Production Pipeline (Three Stages)

Stage 1: Recall (Broad, Loose Threshold)

  • Vector search with low threshold (0.25-0.3)
  • BM25 keyword search in parallel
  • Merge via Reciprocal Rank Fusion (RRF)
  • Target: 50 candidates

Stage 2: Rerank (Precise)

  • Cross-encoder or ColBERT v2 re-scores all candidates
  • ColBERT stores per-token embeddings with late interaction scoring — near cross-encoder accuracy at near bi-encoder speed
  • Highest-ROI improvement most teams can make
  • Target: Top 10-20

Stage 3: Select (Strict, Dynamic Threshold)

  • Gap-based threshold: find natural breakpoint between relevant and irrelevant scores
  • Return only results above the gap
  • Target: 3-10 final results

Performance Impact (Cohere Data)

MethodRecallAccuracyUser Satisfaction
Vector only65%78%3.2/5
+ Keyword (hybrid)82% (+26%)81%3.8/5
+ Reranking89% (+37%)91% (+17%)4.3/5 (+34%)

Advanced Patterns

Contextual Retrieval (Anthropic)

Prepend chunk-specific context before embedding. Instead of embedding a raw chunk, prepend: “This chunk is from the authentication section of the API docs, discussing JWT token validation.”

Simple to implement, meaningful retrieval accuracy gains. Reduces the “lost in the middle” problem.

Agentic RAG

The dominant new enterprise pattern. Specialized agents handle different retrieval stages:

Query Decomposition Agent → breaks complex questions into sub-queries
     ↓
Retrieval Router Agent → directs each sub-query to relevant data subset
     ↓
[Parallel Retrieval Agents] → each searches their assigned data source
     ↓
Validation Agent → checks retrieved chunks for relevance and freshness
     ↓
Synthesis Agent → generates final answer with citations

Real and shipping at companies. Particularly effective for multi-source retrieval (docs + code + tickets + conversations).

GraphRAG

Uses knowledge graphs as the retrieval layer instead of flat vector stores. Enables multi-hop reasoning across entities and relationships.

"What projects did the team that built Feature X also work on?"
  → Entity: Feature X → Relationship: built_by → Team A
  → Entity: Team A → Relationship: built → [Project Y, Project Z]

Significantly better on complex analytical questions requiring relationship reasoning. Neo4j is the primary vendor. Production adoption is real but limited to use cases that genuinely need relationship reasoning (compliance, research, knowledge management).

Not a replacement for vector search — typically used alongside it.

Late Chunking

Instead of chunking documents first then embedding each chunk, embed the full document first (using a long-context embedding model), then extract chunk-level embeddings from the full-document representation. Preserves document-level context in every chunk embedding.

Threshold Strategies

Gap-Based (Google Research)

Find the natural “cliff” in similarity scores:

Doc A: 0.85 ┐
Doc B: 0.82 ├─ Relevant cluster
Doc C: 0.78 ┘
─── Gap ───── ← Natural breakpoint (0.78 - 0.45 = 0.33)
Doc D: 0.45 ┐
Doc E: 0.42 ├─ Irrelevant cluster

Set threshold at midpoint of the largest gap. Far more effective than fixed thresholds.

Adaptive by Query Type

Query TypeThresholdRationale
Definition (“What is X”)0.6Needs precision
Enumeration (“List all X”)0.4Needs recall
How-to (“How to do X”)0.7Needs accuracy
Exploratory (“About X”)0.3Needs breadth

Index Selection (pgvector)

Data ScaleIndexConfig
<10KNone (sequential scan)<100ms queries
10K-100KHNSWm=16, ef_construction=64
>100KHNSW aggressivem=32, ef_construction=128

HNSW is the industry standard — 99% accuracy at any scale, 15ms query time.

The Critical Insight for 2026

Late chunking, contextual embeddings, and hybrid search are higher-leverage improvements than swapping to a better LLM. When generation hallucinates, the root cause is almost always bad retrieval, not bad generation.

Sources