Your First Vector Pipeline
End-to-End: Ingest, Embed, Store, Search, Evaluate
The Scenario
You're building a customer support knowledge base for a SaaS company. The system needs to:
This module walks through every decision in the pipeline.
Step 1: Document Ingestion
Key Decisions
Text extraction: HTML articles need boilerplate removal (nav, footer, sidebar). Keep headings as context markers — they help embeddings understand the section topic.
Metadata extraction: Pull structured fields (category, product area, last_updated) into separate metadata. These become filter fields, not part of the embedded text.
Chunking choice: For help articles averaging 1,500 tokens each:
What to Store per Chunk
| Field | Purpose | Example |
|---|---|---|
| chunk_id | Unique identifier | "article-42-chunk-3" |
| embedding | Vector (1536-dim) | [0.02, -0.15, ...] |
| text | Original chunk text | "To reset your password..." |
| article_id | Parent document reference | "article-42" |
| title | Article title for display | "Password Reset Guide" |
| category | Filter field | "account-management" |
| product | Filter field | "web-app" |
| updated_at | Recency filter | "2026-04-15" |
Step 2: Embedding Generation
Model Selection for This Use Case
Applying the decision framework from Module 5:
| Factor | Requirement | Choice |
|---|---|---|
| Languages | English only | No multilingual needed |
| Quality | High (support accuracy matters) | Top-tier model |
| Scale | 2,500 chunks (small) | Cost isn't a concern |
| Latency | Sub-100ms query time | Any model works |
| Infrastructure | Already using Supabase | pgvector available |
Decision: OpenAI text-embedding-3-small (1536 dims). Excellent quality, low cost at this scale ($0.05 total for all chunks), well-supported.
Embedding Pipeline Details
Batching: Embed in batches of 100 (API limit varies by provider). Don't embed one at a time — it's 100x slower.
Error handling: API calls can fail. Implement retries with exponential backoff. Track which chunks succeeded so you can resume.
Versioning: Store the model name and version alongside vectors. When you upgrade models, you'll need to re-embed everything — knowing which model generated which vectors prevents mixing incompatible embeddings.
Step 3: Storage Setup
For this scale (2,500 chunks), pgvector in Supabase is the obvious choice:
Schema Design
The table needs: vector column (with HNSW index), text storage, and metadata columns (with B-tree indexes for filtering).
Index Tuning
For 2,500 vectors, a flat (brute-force) scan takes < 1ms. You could skip indexing entirely. But for good practice:
At this scale, index build takes seconds. At 1M+ vectors, it takes minutes to hours.
Step 4: Search Implementation
Query Pipeline
Hybrid Search Setup
Even at small scale, hybrid search is worth implementing:
This catches exact-match queries ("ERR_EXPORT_FAILED") that pure vector search might miss.
Result Enrichment
Don't just return the matching chunk — provide context:
Step 5: Evaluation
Building Your Eval Set
Create 50 query-expected_result pairs:
| Query | Expected Article(s) | Type |
|---|---|---|
| "how to reset password" | Password Reset Guide | Direct match |
| "account locked after too many attempts" | Password Reset Guide, Account Security | Semantic match |
| "ERR_EXPORT_FAILED" | Data Export Troubleshooting | Exact match |
| "cancel my subscription" | Billing FAQ, Account Deletion | Intent match |
| "GDPR data request" | Privacy Policy, Data Export | Domain-specific |
Metrics to Track
| Metric | Your Score | Target | Action if Below |
|---|---|---|---|
| Recall@5 | ? | > 0.90 | Add hybrid search or re-ranking |
| MRR | ? | > 0.80 | Improve chunking or add title embeddings |
| Latency P95 | ? | < 100ms | Add or tune HNSW index |
| Filter accuracy | ? | 1.00 | Check metadata extraction pipeline |
Common Issues and Fixes
| Problem | Symptom | Fix |
|---|---|---|
| Chunks too small | Results lack context | Increase chunk size to 400-500 tokens |
| Chunks too large | Wrong passages match | Decrease chunk size, add title embedding |
| Missing keywords | Exact queries fail | Add hybrid search (BM25) |
| Redundant results | Top 5 are from same article | Apply MMR (λ=0.7) |
| Stale results | Outdated articles rank high | Add recency boost or filter |
| Cross-topic matches | "billing" matches "building" | Try a larger embedding model |
The Complete Architecture
Decision Memo Template
After completing this pipeline, document your decisions:
Embedding model: OpenAI text-embedding-3-small (1536 dims)
Vector database: pgvector (Supabase)
Search strategy: Hybrid (vector + BM25) with RRF
Chunking: Recursive, 300 tokens, 50 overlap
Key Takeaways
This is chapter 6 of Vector Databases & Embeddings.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details