Encoding Pipeline
Chunk & Embed
From Documents to Vectors
Raw documents are too large for AI to process directly. A 2,000-word call transcript can't fit into a search query's context window, and comparing whole documents is computationally expensive. The encoding pipeline solves both problems:
Key Concepts
Chunking Strategies
Chunking is the art of splitting documents into pieces that are small enough to be useful but large enough to retain meaning.
Fixed-size chunking — Split every N tokens (e.g., 512) with overlap (e.g., 50 tokens). Simple, predictable, works for most content. The overlap ensures you don't lose meaning at chunk boundaries.
Document-aware chunking — Split on natural boundaries: section headers, paragraph breaks, speaker turns in transcripts. Produces chunks of varying size but preserves semantic coherence.
The right strategy depends on your data. CRM records are already small — one per document. Call transcripts benefit from speaker-turn splitting. Product docs split well on headers.
Embeddings
An embedding is a numerical representation of text — typically a vector of 768 to 1536 floating-point numbers. Texts with similar meaning produce vectors that are close together in this high-dimensional space.
"quarterly revenue grew 15%" → [0.23, -0.14, 0.67, ...]
"Q4 sales increased by 15%" → [0.25, -0.12, 0.65, ...] ← very close!
"the weather is sunny today" → [-0.44, 0.91, -0.03, ...] ← far awayEmbedding models (OpenAI's text-embedding-3-small, Cohere's embed-v3) are trained on massive text corpora and capture semantic relationships that keyword matching misses entirely.
Vector Storage with pgvector
pgvector extends PostgreSQL with vector data types and similarity search operators. Why pgvector over a dedicated vector database?
HNSW vs IVFFlat: HNSW (Hierarchical Navigable Small World) builds a graph structure that's slower to build but faster to query and doesn't require training. IVFFlat clusters vectors and is faster to build but needs a representative sample. For most production systems, HNSW is the better default.
Architecture Pattern
Document[] ──→ Chunker ──→ Chunk[] ──→ Embedder ──→ Vector[] ──→ pgvector
│ │
Strategy pattern Batch + retry
(fixed vs aware) (rate limit handling)Metadata Per Chunk
Each chunk inherits and extends its parent document's metadata:
This metadata is critical for filtering and reranking in Module 3.
What You'll Build
Glossary
| Term | Meaning |
|---|---|
| Chunk | A smaller piece of a document, optimized for search |
| Embedding | A numerical vector representing text meaning |
| Vector | An array of floats representing a point in semantic space |
| pgvector | PostgreSQL extension for vector storage and search |
| HNSW | Graph-based index for fast approximate nearest-neighbor search |
| Cosine similarity | Measure of angle between two vectors (1.0 = identical meaning) |
| Token | A word-piece unit (~0.75 words per token for English) |
This is chapter 2 of AI Sales Companion.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details