Search & Retrieval Patterns
Beyond Basic Similarity Search
The Retrieval Stack
Production search isn't just "find nearest vectors." It's a pipeline with multiple stages, each improving result quality:
Pattern 1: Semantic Search (Vector Only)
The simplest pattern. Embed the query, find nearest vectors, return results.
When it works: Queries and documents use different words for the same concepts. Natural language questions against technical documentation. Cross-language search.
When it fails: Exact-match queries (error codes, product SKUs, proper nouns). The query "ERR_CONNECTION_REFUSED" should match documents containing that exact string, not semantically similar error descriptions.
Pattern 2: Keyword Search (BM25 / Full-Text)
Traditional search using term frequency and inverse document frequency. PostgreSQL's tsvector, Elasticsearch, or Tantivy.
When it works: Exact-match queries, known-item search, acronyms, codes, names.
When it fails: Vocabulary mismatch. Conceptual queries where the user doesn't know the right terms.
Pattern 3: Hybrid Search
Combine vector and keyword search for the best of both worlds. This is the recommended default for production systems.
How Hybrid Search Works
Fusion Strategies
| Strategy | How | Best For |
|---|---|---|
| Reciprocal Rank Fusion (RRF) | Score = sum(1 / (k + rank)) across methods | General purpose, no tuning needed |
| Weighted sum | Score = α × vector_score + (1-α) × bm25_score | When you know which method matters more |
| Convex combination | Normalize scores to [0,1], then weighted sum | Fairer comparison between methods |
RRF is the go-to because it doesn't require score normalization and works well without tuning. Use k=60 as a default.
Hybrid Search Results: Why It Wins
For the query "how to handle authentication timeouts":
| Method | Top result | Why |
|---|---|---|
| Vector only | "Managing session expiration in web apps" | Semantic match, but not specific enough |
| Keyword only | "AuthTimeout error code reference" | Exact keyword match, but wrong intent |
| Hybrid | "Handling auth token refresh and timeout recovery" | Combines semantic understanding with keyword precision |
Pattern 4: Re-ranking
Retrieval (vector or hybrid) is fast but imprecise. It uses bi-encoder models that embed query and document independently. Re-ranking uses cross-encoder models that process query and document together — much more accurate but 100x slower.
The Two-Stage Pipeline
Cross-encoders score the pair (query, document) directly. They see both texts simultaneously and can capture fine-grained relevance signals that bi-encoders miss.
Popular Re-rankers
| Model | Provider | Quality | Speed |
|---|---|---|---|
| rerank-v3 | Cohere | Excellent | ~50ms per 100 docs |
| bge-reranker-v2 | Open source | Very good | ~100ms per 100 docs |
| cross-encoder/ms-marco | Open source | Good | ~150ms per 100 docs |
| Jina Reranker | Jina AI | Very good | ~60ms per 100 docs |
Re-ranking typically improves nDCG@10 by 5-15% — a significant quality boost for the latency cost.
Pattern 5: Maximal Marginal Relevance (MMR)
Standard search returns the most similar results, which often means redundant results. MMR balances relevance with diversity.
The algorithm:
λ = 0.7 is a good default — mostly relevant, with enough diversity to avoid redundancy.
Use MMR when:
Pattern 6: Multi-Vector Strategies
Instead of one vector per document, create multiple vectors to capture different aspects:
Late Interaction (ColBERT-style)
Store one vector per token instead of one per document. At query time, compute fine-grained token-level similarity. Much higher quality but 100x more storage.
Multi-representation
Generate multiple embeddings per document:
Query matches against all representations; best match wins.
Hypothetical Document Embeddings (HyDE)
This works because the hypothetical answer is in document space (it reads like a document, not a question), bridging the asymmetry between queries and documents.
Pattern 7: Filtered Search
Combine vector similarity with structured filters:
Filter Optimization
| Strategy | When to Use |
|---|---|
| Pre-filter | When filter removes > 90% of data (small result set) |
| Post-filter | When filter removes < 10% of data (most vectors match) |
| Integrated | When filter selectivity varies (most vector DBs do this) |
Always create metadata indexes on fields you filter frequently. Without them, filters become full scans.
Evaluation Metrics
How do you know if your search is good?
| Metric | What It Measures | Target |
|---|---|---|
| Recall@K | % of relevant docs in top K | > 0.90 |
| nDCG@10 | Ranking quality (higher = better ordered) | > 0.70 |
| MRR | Average rank of first relevant result | > 0.80 |
| Latency P95 | 95th percentile response time | < 200ms |
Build an evaluation set of 50-100 query-relevance pairs. Run it after every change to your search pipeline.
Key Takeaways
This is chapter 5 of Vector Databases & Embeddings.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details