Back to guides
2
5 min

Encoding Pipeline

Chunk & Embed HR Documents

From Policies to Vectors

A 2,000-word PTO policy can't be processed as a single unit for search. When an employee asks "How many vacation days do I get after 3 years?", the AI needs to find the specific paragraph about accrual rates — not the entire policy including blackout periods and carryover rules.

The encoding pipeline solves this:

  • Chunk documents into smaller, meaningful pieces
  • Embed each chunk as a numerical vector
  • Store vectors in a database optimized for similarity search
  • Key Concepts

    Chunking Strategies for HR Data

    HR documents are structurally different from generic text. A PTO policy has numbered sections (3.1, 3.2), a handbook has named sections, and an org chart entry is already a single record. One chunking strategy doesn't fit all.

    Fixed-size chunking — Split every ~1,500 characters with 200-character overlap. Simple and predictable. Works well for benefits guides where paragraphs are relatively uniform.

    Document-aware chunking — Split on structural boundaries: section headers, numbered clauses, policy subsections. This ensures that "Section 4.3: Reporting Process" stays together as one chunk rather than being split across two chunks at an arbitrary character boundary.

    Per-record chunking — For org chart entries and PTO records, each record is already small enough to be a single chunk. Using a very large chunk size (5,000 characters) effectively keeps each record intact.

    Why Chunking Strategy Matters for Compliance

    Consider this scenario: the harassment policy says "Reports must be filed within 30 days" in one section and "Investigations will be completed within 15 business days" in another. If a naive chunker splits these into the same chunk, the AI might conflate the two timelines. Document-aware chunking preserves the section boundaries, keeping each requirement in its own retrievable unit.

    Metadata Per Chunk

    Each chunk inherits and extends its parent document's metadata:

    Document metadata        Chunk metadata
    ─────────────────       ─────────────────
    source_type: "policy"   source_type: "policy"
    category: "leave"       category: "leave"
    effective_date: "2025"  effective_date: "2025"
    applicable_states: [CA] applicable_states: [CA]
    confidentiality: "int"  confidentiality: "internal"
                            chunk_index: 3
                            section_heading: "Carryover"

    This metadata is critical for filtering in Module 3. When someone asks "What's the PTO policy in California?", the retrieval system can filter chunks by category: "leave" AND applicable_states: ["CA", "all"] before running vector similarity — dramatically improving precision.

    Embeddings

    An embedding is a numerical vector (384 to 1,536 dimensions) that captures the *meaning* of text. Similar concepts produce vectors that are close together:

    "parental leave for new mothers"  →  [0.23, -0.14, 0.67, ...]
    "maternity leave policy"          →  [0.25, -0.12, 0.65, ...]  ← very close!
    "expense reimbursement rates"     →  [-0.44, 0.91, -0.03, ...] ← far away

    This is why semantic search beats keyword search for HR — an employee asking about "maternity leave" finds the "parental leave" policy even though the exact word doesn't appear.

    Vector Storage with pgvector

    pgvector extends PostgreSQL with vector data types and similarity search. For an HR system, this is ideal because:

  • One database for vectors, metadata, and employee data
  • SQL filters combined with vector search (e.g., "similar to X WHERE state = 'CA'")
  • HNSW indexing for fast approximate nearest-neighbor search
  • Supabase provides managed pgvector out of the box
  • HNSW vs IVFFlat: HNSW builds a graph structure that's slower to build but faster to query. For HR data (typically <50K vectors), HNSW is the clear winner — no training step needed, and query latency is sub-millisecond.

    Architecture Pattern

    Document[] ──→ Chunker ──→ Chunk[] ──→ Embedder ──→ Vector[] ──→ pgvector
                     │                        │
              Strategy pattern         Batch + retry
            (fixed vs document-aware)  (rate limit handling)

    Design Tradeoffs

    DecisionTradeoff
    Chunk size (1,500 chars)Larger = more context per chunk, fewer chunks. Smaller = more precise retrieval but risk losing context
    Overlap (200 chars)Prevents losing meaning at chunk boundaries. Costs more storage but improves recall
    Local vs API embeddingsLocal (Xenova) = free, private, slower. API (OpenAI) = better quality, costs money, data leaves your network
    HNSW ef_construction (64)Higher = better recall, slower index build. 64 is a good default for <100K vectors

    What You'll Build

  • Explore the two chunking strategies and understand how they handle HR documents
  • Add HR-specific metadata enrichment to chunks
  • Generate embeddings (simulated in sandbox, real with Xenova/OpenAI in production)
  • Run the full pipeline: documents into searchable chunks
  • Glossary

    TermMeaning
    ChunkA smaller piece of a document, optimized for search
    EmbeddingA numerical vector representing text meaning
    pgvectorPostgreSQL extension for vector storage and search
    HNSWGraph-based index for fast approximate nearest-neighbor search
    Cosine similarityMeasure of angle between two vectors (1.0 = identical meaning)
    Strategy patternDesign pattern where algorithm is selected at runtime

    This is chapter 2 of AI HR Assistant.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details