14 min

Search & Retrieval Patterns

Beyond Basic Similarity Search

The Retrieval Stack

Production search isn't just "find nearest vectors." It's a pipeline with multiple stages, each improving result quality:

Loading diagram...

Pattern 1: Semantic Search (Vector Only)

The simplest pattern. Embed the query, find nearest vectors, return results.

When it works: Queries and documents use different words for the same concepts. Natural language questions against technical documentation. Cross-language search.

When it fails: Exact-match queries (error codes, product SKUs, proper nouns). The query "ERR_CONNECTION_REFUSED" should match documents containing that exact string, not semantically similar error descriptions.

Pattern 2: Keyword Search (BM25 / Full-Text)

Traditional search using term frequency and inverse document frequency. PostgreSQL's tsvector, Elasticsearch, or Tantivy.

When it works: Exact-match queries, known-item search, acronyms, codes, names.

When it fails: Vocabulary mismatch. Conceptual queries where the user doesn't know the right terms.

Pattern 3: Hybrid Search

Combine vector and keyword search for the best of both worlds. This is the recommended default for production systems.

How Hybrid Search Works

Run vector search — get top N candidates with similarity scores

Run keyword search — get top N candidates with BM25 scores

Fuse results using a combination strategy

Fusion Strategies

Strategy	How	Best For
Reciprocal Rank Fusion (RRF)	Score = sum(1 / (k + rank)) across methods	General purpose, no tuning needed
Weighted sum	Score = α × vector_score + (1-α) × bm25_score	When you know which method matters more
Convex combination	Normalize scores to [0,1], then weighted sum	Fairer comparison between methods

RRF is the go-to because it doesn't require score normalization and works well without tuning. Use k=60 as a default.

Hybrid Search Results: Why It Wins

For the query "how to handle authentication timeouts":

Method	Top result	Why
Vector only	"Managing session expiration in web apps"	Semantic match, but not specific enough
Keyword only	"AuthTimeout error code reference"	Exact keyword match, but wrong intent
Hybrid	"Handling auth token refresh and timeout recovery"	Combines semantic understanding with keyword precision

Pattern 4: Re-ranking

Retrieval (vector or hybrid) is fast but imprecise. It uses bi-encoder models that embed query and document independently. Re-ranking uses cross-encoder models that process query and document together — much more accurate but 100x slower.

The Two-Stage Pipeline

Retrieve 50-100 candidates using vector/hybrid search (fast, ~10ms)

Re-rank with a cross-encoder to find the true top 5-10 (slower, ~100ms)

Cross-encoders score the pair (query, document) directly. They see both texts simultaneously and can capture fine-grained relevance signals that bi-encoders miss.

Popular Re-rankers

Model	Provider	Quality	Speed
rerank-v3	Cohere	Excellent	~50ms per 100 docs
bge-reranker-v2	Open source	Very good	~100ms per 100 docs
cross-encoder/ms-marco	Open source	Good	~150ms per 100 docs
Jina Reranker	Jina AI	Very good	~60ms per 100 docs

Re-ranking typically improves nDCG@10 by 5-15% — a significant quality boost for the latency cost.

Pattern 5: Maximal Marginal Relevance (MMR)

Standard search returns the most similar results, which often means redundant results. MMR balances relevance with diversity.

The algorithm:

Select the most relevant result

For each subsequent result, score: λ × similarity_to_query - (1-λ) × max_similarity_to_selected

Higher λ = more relevant, lower λ = more diverse

λ = 0.7 is a good default — mostly relevant, with enough diversity to avoid redundancy.

Use MMR when:

Building RAG systems (diverse context = better answers)

Search results that users browse (variety matters)

Summarization (need different perspectives, not restatements)

Pattern 6: Multi-Vector Strategies

Instead of one vector per document, create multiple vectors to capture different aspects:

Late Interaction (ColBERT-style)

Store one vector per token instead of one per document. At query time, compute fine-grained token-level similarity. Much higher quality but 100x more storage.

Multi-representation

Generate multiple embeddings per document:

Title embedding — captures the main topic

Content embedding — captures details

Summary embedding — captures key points

Query matches against all representations; best match wins.

Hypothetical Document Embeddings (HyDE)

Ask an LLM to generate a hypothetical answer to the query

Embed that hypothetical answer

Search using the hypothetical answer's embedding

This works because the hypothetical answer is in document space (it reads like a document, not a question), bridging the asymmetry between queries and documents.

Pattern 7: Filtered Search

Combine vector similarity with structured filters:

"Find similar support tickets from the last 7 days"

"Find relevant products under $50 in category: electronics"

"Find matching candidates with 5+ years experience in New York"

Filter Optimization

Strategy	When to Use
Pre-filter	When filter removes > 90% of data (small result set)
Post-filter	When filter removes < 10% of data (most vectors match)
Integrated	When filter selectivity varies (most vector DBs do this)

Always create metadata indexes on fields you filter frequently. Without them, filters become full scans.

Evaluation Metrics

How do you know if your search is good?

Metric	What It Measures	Target
Recall@K	% of relevant docs in top K	> 0.90
nDCG@10	Ranking quality (higher = better ordered)	> 0.70
MRR	Average rank of first relevant result	> 0.80
Latency P95	95th percentile response time	< 200ms

Build an evaluation set of 50-100 query-relevance pairs. Run it after every change to your search pipeline.

Key Takeaways

Hybrid search (vector + keyword) is the production default

RRF is the simplest, most robust fusion strategy

Re-ranking with cross-encoders adds 5-15% quality for ~100ms latency

MMR prevents redundant results in RAG and browsing

Filtered search requires metadata indexes for performance

Always measure with an evaluation set — intuition isn't reliable

This is chapter 5 of Vector Databases & Embeddings.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 4: The Vector DB Landscape

Ch. 6: Your First Vector Pipeline