Back to guides
5
14 min

Search & Retrieval Patterns

Beyond Basic Similarity Search

The Retrieval Stack

Production search isn't just "find nearest vectors." It's a pipeline with multiple stages, each improving result quality:

Loading diagram...

Pattern 1: Semantic Search (Vector Only)

The simplest pattern. Embed the query, find nearest vectors, return results.

When it works: Queries and documents use different words for the same concepts. Natural language questions against technical documentation. Cross-language search.

When it fails: Exact-match queries (error codes, product SKUs, proper nouns). The query "ERR_CONNECTION_REFUSED" should match documents containing that exact string, not semantically similar error descriptions.

Pattern 2: Keyword Search (BM25 / Full-Text)

Traditional search using term frequency and inverse document frequency. PostgreSQL's tsvector, Elasticsearch, or Tantivy.

When it works: Exact-match queries, known-item search, acronyms, codes, names.

When it fails: Vocabulary mismatch. Conceptual queries where the user doesn't know the right terms.

Pattern 3: Hybrid Search

Combine vector and keyword search for the best of both worlds. This is the recommended default for production systems.

How Hybrid Search Works

  • Run vector search — get top N candidates with similarity scores
  • Run keyword search — get top N candidates with BM25 scores
  • Fuse results using a combination strategy
  • Fusion Strategies

    StrategyHowBest For
    Reciprocal Rank Fusion (RRF)Score = sum(1 / (k + rank)) across methodsGeneral purpose, no tuning needed
    Weighted sumScore = α × vector_score + (1-α) × bm25_scoreWhen you know which method matters more
    Convex combinationNormalize scores to [0,1], then weighted sumFairer comparison between methods

    RRF is the go-to because it doesn't require score normalization and works well without tuning. Use k=60 as a default.

    Hybrid Search Results: Why It Wins

    For the query "how to handle authentication timeouts":

    MethodTop resultWhy
    Vector only"Managing session expiration in web apps"Semantic match, but not specific enough
    Keyword only"AuthTimeout error code reference"Exact keyword match, but wrong intent
    Hybrid"Handling auth token refresh and timeout recovery"Combines semantic understanding with keyword precision

    Pattern 4: Re-ranking

    Retrieval (vector or hybrid) is fast but imprecise. It uses bi-encoder models that embed query and document independently. Re-ranking uses cross-encoder models that process query and document together — much more accurate but 100x slower.

    The Two-Stage Pipeline

  • Retrieve 50-100 candidates using vector/hybrid search (fast, ~10ms)
  • Re-rank with a cross-encoder to find the true top 5-10 (slower, ~100ms)
  • Cross-encoders score the pair (query, document) directly. They see both texts simultaneously and can capture fine-grained relevance signals that bi-encoders miss.

    Popular Re-rankers

    ModelProviderQualitySpeed
    rerank-v3CohereExcellent~50ms per 100 docs
    bge-reranker-v2Open sourceVery good~100ms per 100 docs
    cross-encoder/ms-marcoOpen sourceGood~150ms per 100 docs
    Jina RerankerJina AIVery good~60ms per 100 docs

    Re-ranking typically improves nDCG@10 by 5-15% — a significant quality boost for the latency cost.

    Pattern 5: Maximal Marginal Relevance (MMR)

    Standard search returns the most similar results, which often means redundant results. MMR balances relevance with diversity.

    The algorithm:

  • Select the most relevant result
  • For each subsequent result, score: λ × similarity_to_query - (1-λ) × max_similarity_to_selected
  • Higher λ = more relevant, lower λ = more diverse
  • λ = 0.7 is a good default — mostly relevant, with enough diversity to avoid redundancy.

    Use MMR when:

  • Building RAG systems (diverse context = better answers)
  • Search results that users browse (variety matters)
  • Summarization (need different perspectives, not restatements)
  • Pattern 6: Multi-Vector Strategies

    Instead of one vector per document, create multiple vectors to capture different aspects:

    Late Interaction (ColBERT-style)

    Store one vector per token instead of one per document. At query time, compute fine-grained token-level similarity. Much higher quality but 100x more storage.

    Multi-representation

    Generate multiple embeddings per document:

  • Title embedding — captures the main topic
  • Content embedding — captures details
  • Summary embedding — captures key points
  • Query matches against all representations; best match wins.

    Hypothetical Document Embeddings (HyDE)

  • Ask an LLM to generate a hypothetical answer to the query
  • Embed that hypothetical answer
  • Search using the hypothetical answer's embedding
  • This works because the hypothetical answer is in document space (it reads like a document, not a question), bridging the asymmetry between queries and documents.

    Pattern 7: Filtered Search

    Combine vector similarity with structured filters:

  • "Find similar support tickets from the last 7 days"
  • "Find relevant products under $50 in category: electronics"
  • "Find matching candidates with 5+ years experience in New York"
  • Filter Optimization

    StrategyWhen to Use
    Pre-filterWhen filter removes > 90% of data (small result set)
    Post-filterWhen filter removes < 10% of data (most vectors match)
    IntegratedWhen filter selectivity varies (most vector DBs do this)

    Always create metadata indexes on fields you filter frequently. Without them, filters become full scans.

    Evaluation Metrics

    How do you know if your search is good?

    MetricWhat It MeasuresTarget
    Recall@K% of relevant docs in top K> 0.90
    nDCG@10Ranking quality (higher = better ordered)> 0.70
    MRRAverage rank of first relevant result> 0.80
    Latency P9595th percentile response time< 200ms

    Build an evaluation set of 50-100 query-relevance pairs. Run it after every change to your search pipeline.

    Key Takeaways

  • Hybrid search (vector + keyword) is the production default
  • RRF is the simplest, most robust fusion strategy
  • Re-ranking with cross-encoders adds 5-15% quality for ~100ms latency
  • MMR prevents redundant results in RAG and browsing
  • Filtered search requires metadata indexes for performance
  • Always measure with an evaluation set — intuition isn't reliable
  • This is chapter 5 of Vector Databases & Embeddings.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details