5 min

Encoding Pipeline

Chunk & Embed

From Documents to Vectors

Raw documents are too large for AI to process directly. A 2,000-word call transcript can't fit into a search query's context window, and comparing whole documents is computationally expensive. The encoding pipeline solves both problems:

Chunk documents into smaller, meaningful pieces

Embed each chunk as a numerical vector

Store vectors in a database optimized for similarity search

Key Concepts

Chunking Strategies

Chunking is the art of splitting documents into pieces that are small enough to be useful but large enough to retain meaning.

Fixed-size chunking — Split every N tokens (e.g., 512) with overlap (e.g., 50 tokens). Simple, predictable, works for most content. The overlap ensures you don't lose meaning at chunk boundaries.

Document-aware chunking — Split on natural boundaries: section headers, paragraph breaks, speaker turns in transcripts. Produces chunks of varying size but preserves semantic coherence.

The right strategy depends on your data. CRM records are already small — one per document. Call transcripts benefit from speaker-turn splitting. Product docs split well on headers.

Embeddings

An embedding is a numerical representation of text — typically a vector of 768 to 1536 floating-point numbers. Texts with similar meaning produce vectors that are close together in this high-dimensional space.

"quarterly revenue grew 15%"  →  [0.23, -0.14, 0.67, ...]
"Q4 sales increased by 15%"  →  [0.25, -0.12, 0.65, ...]  ← very close!
"the weather is sunny today" →  [-0.44, 0.91, -0.03, ...] ← far away

Embedding models (OpenAI's text-embedding-3-small, Cohere's embed-v3) are trained on massive text corpora and capture semantic relationships that keyword matching misses entirely.

Vector Storage with pgvector

pgvector extends PostgreSQL with vector data types and similarity search operators. Why pgvector over a dedicated vector database?

One database for vectors, metadata, and relational data

SQL filters combined with vector search (e.g., "similar to X WHERE account = 'Acme'")

HNSW indexing for fast approximate nearest-neighbor search

Supabase provides managed pgvector out of the box

HNSW vs IVFFlat: HNSW (Hierarchical Navigable Small World) builds a graph structure that's slower to build but faster to query and doesn't require training. IVFFlat clusters vectors and is faster to build but needs a representative sample. For most production systems, HNSW is the better default.

Architecture Pattern

Document[] ──→ Chunker ──→ Chunk[] ──→ Embedder ──→ Vector[] ──→ pgvector
                 │                        │
          Strategy pattern         Batch + retry
          (fixed vs aware)         (rate limit handling)

Metadata Per Chunk

Each chunk inherits and extends its parent document's metadata:

source_type — crm, transcript, product, etc.

source_id — original document ID

account_name — for CRM and transcript data

date — when the source was created

chunk_index — position within the original document

topic — extracted topic or section header

This metadata is critical for filtering and reranking in Module 3.

What You'll Build

Design and implement two chunking strategies

Generate embeddings via API with batch processing and retry logic

Store vectors in pgvector with HNSW indexing

Run the full pipeline: 24 documents → ~150 searchable chunks

Glossary

Term	Meaning
Chunk	A smaller piece of a document, optimized for search
Embedding	A numerical vector representing text meaning
Vector	An array of floats representing a point in semantic space
pgvector	PostgreSQL extension for vector storage and search
HNSW	Graph-based index for fast approximate nearest-neighbor search
Cosine similarity	Measure of angle between two vectors (1.0 = identical meaning)
Token	A word-piece unit (~0.75 words per token for English)

This is chapter 2 of AI Sales Companion.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 1: Data Lake

Ch. 3: Retrieval System