Back to guides
2
5 min

Encoding Pipeline

Chunk & Embed

From Documents to Vectors

Raw documents are too large for AI to process directly. A 2,000-word call transcript can't fit into a search query's context window, and comparing whole documents is computationally expensive. The encoding pipeline solves both problems:

  • Chunk documents into smaller, meaningful pieces
  • Embed each chunk as a numerical vector
  • Store vectors in a database optimized for similarity search
  • Key Concepts

    Chunking Strategies

    Chunking is the art of splitting documents into pieces that are small enough to be useful but large enough to retain meaning.

    Fixed-size chunking — Split every N tokens (e.g., 512) with overlap (e.g., 50 tokens). Simple, predictable, works for most content. The overlap ensures you don't lose meaning at chunk boundaries.

    Document-aware chunking — Split on natural boundaries: section headers, paragraph breaks, speaker turns in transcripts. Produces chunks of varying size but preserves semantic coherence.

    The right strategy depends on your data. CRM records are already small — one per document. Call transcripts benefit from speaker-turn splitting. Product docs split well on headers.

    Embeddings

    An embedding is a numerical representation of text — typically a vector of 768 to 1536 floating-point numbers. Texts with similar meaning produce vectors that are close together in this high-dimensional space.

    "quarterly revenue grew 15%"  →  [0.23, -0.14, 0.67, ...]
    "Q4 sales increased by 15%"  →  [0.25, -0.12, 0.65, ...]  ← very close!
    "the weather is sunny today" →  [-0.44, 0.91, -0.03, ...] ← far away

    Embedding models (OpenAI's text-embedding-3-small, Cohere's embed-v3) are trained on massive text corpora and capture semantic relationships that keyword matching misses entirely.

    Vector Storage with pgvector

    pgvector extends PostgreSQL with vector data types and similarity search operators. Why pgvector over a dedicated vector database?

  • One database for vectors, metadata, and relational data
  • SQL filters combined with vector search (e.g., "similar to X WHERE account = 'Acme'")
  • HNSW indexing for fast approximate nearest-neighbor search
  • Supabase provides managed pgvector out of the box
  • HNSW vs IVFFlat: HNSW (Hierarchical Navigable Small World) builds a graph structure that's slower to build but faster to query and doesn't require training. IVFFlat clusters vectors and is faster to build but needs a representative sample. For most production systems, HNSW is the better default.

    Architecture Pattern

    Document[] ──→ Chunker ──→ Chunk[] ──→ Embedder ──→ Vector[] ──→ pgvector
                     │                        │
              Strategy pattern         Batch + retry
              (fixed vs aware)         (rate limit handling)

    Metadata Per Chunk

    Each chunk inherits and extends its parent document's metadata:

  • source_type — crm, transcript, product, etc.
  • source_id — original document ID
  • account_name — for CRM and transcript data
  • date — when the source was created
  • chunk_index — position within the original document
  • topic — extracted topic or section header
  • This metadata is critical for filtering and reranking in Module 3.

    What You'll Build

  • Design and implement two chunking strategies
  • Generate embeddings via API with batch processing and retry logic
  • Store vectors in pgvector with HNSW indexing
  • Run the full pipeline: 24 documents → ~150 searchable chunks
  • Glossary

    TermMeaning
    ChunkA smaller piece of a document, optimized for search
    EmbeddingA numerical vector representing text meaning
    VectorAn array of floats representing a point in semantic space
    pgvectorPostgreSQL extension for vector storage and search
    HNSWGraph-based index for fast approximate nearest-neighbor search
    Cosine similarityMeasure of angle between two vectors (1.0 = identical meaning)
    TokenA word-piece unit (~0.75 words per token for English)

    This is chapter 2 of AI Sales Companion.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details