5 min

Encoding Pipeline

Chunk & Embed HR Documents

From Policies to Vectors

A 2,000-word PTO policy can't be processed as a single unit for search. When an employee asks "How many vacation days do I get after 3 years?", the AI needs to find the specific paragraph about accrual rates — not the entire policy including blackout periods and carryover rules.

The encoding pipeline solves this:

Chunk documents into smaller, meaningful pieces

Embed each chunk as a numerical vector

Store vectors in a database optimized for similarity search

Key Concepts

Chunking Strategies for HR Data

HR documents are structurally different from generic text. A PTO policy has numbered sections (3.1, 3.2), a handbook has named sections, and an org chart entry is already a single record. One chunking strategy doesn't fit all.

Fixed-size chunking — Split every ~1,500 characters with 200-character overlap. Simple and predictable. Works well for benefits guides where paragraphs are relatively uniform.

Document-aware chunking — Split on structural boundaries: section headers, numbered clauses, policy subsections. This ensures that "Section 4.3: Reporting Process" stays together as one chunk rather than being split across two chunks at an arbitrary character boundary.

Per-record chunking — For org chart entries and PTO records, each record is already small enough to be a single chunk. Using a very large chunk size (5,000 characters) effectively keeps each record intact.

Why Chunking Strategy Matters for Compliance

Consider this scenario: the harassment policy says "Reports must be filed within 30 days" in one section and "Investigations will be completed within 15 business days" in another. If a naive chunker splits these into the same chunk, the AI might conflate the two timelines. Document-aware chunking preserves the section boundaries, keeping each requirement in its own retrievable unit.

Metadata Per Chunk

Each chunk inherits and extends its parent document's metadata:

Document metadata        Chunk metadata
─────────────────       ─────────────────
source_type: "policy"   source_type: "policy"
category: "leave"       category: "leave"
effective_date: "2025"  effective_date: "2025"
applicable_states: [CA] applicable_states: [CA]
confidentiality: "int"  confidentiality: "internal"
                        chunk_index: 3
                        section_heading: "Carryover"

This metadata is critical for filtering in Module 3. When someone asks "What's the PTO policy in California?", the retrieval system can filter chunks by category: "leave" AND applicable_states: ["CA", "all"] before running vector similarity — dramatically improving precision.

Embeddings

An embedding is a numerical vector (384 to 1,536 dimensions) that captures the *meaning* of text. Similar concepts produce vectors that are close together:

"parental leave for new mothers"  →  [0.23, -0.14, 0.67, ...]
"maternity leave policy"          →  [0.25, -0.12, 0.65, ...]  ← very close!
"expense reimbursement rates"     →  [-0.44, 0.91, -0.03, ...] ← far away

This is why semantic search beats keyword search for HR — an employee asking about "maternity leave" finds the "parental leave" policy even though the exact word doesn't appear.

Vector Storage with pgvector

pgvector extends PostgreSQL with vector data types and similarity search. For an HR system, this is ideal because:

One database for vectors, metadata, and employee data

SQL filters combined with vector search (e.g., "similar to X WHERE state = 'CA'")

HNSW indexing for fast approximate nearest-neighbor search

Supabase provides managed pgvector out of the box

HNSW vs IVFFlat: HNSW builds a graph structure that's slower to build but faster to query. For HR data (typically <50K vectors), HNSW is the clear winner — no training step needed, and query latency is sub-millisecond.

Architecture Pattern

Document[] ──→ Chunker ──→ Chunk[] ──→ Embedder ──→ Vector[] ──→ pgvector
                 │                        │
          Strategy pattern         Batch + retry
        (fixed vs document-aware)  (rate limit handling)

Design Tradeoffs

Decision	Tradeoff
Chunk size (1,500 chars)	Larger = more context per chunk, fewer chunks. Smaller = more precise retrieval but risk losing context
Overlap (200 chars)	Prevents losing meaning at chunk boundaries. Costs more storage but improves recall
Local vs API embeddings	Local (Xenova) = free, private, slower. API (OpenAI) = better quality, costs money, data leaves your network
HNSW ef_construction (64)	Higher = better recall, slower index build. 64 is a good default for <100K vectors

What You'll Build

Explore the two chunking strategies and understand how they handle HR documents

Add HR-specific metadata enrichment to chunks

Generate embeddings (simulated in sandbox, real with Xenova/OpenAI in production)

Run the full pipeline: documents into searchable chunks

Glossary

Term	Meaning
Chunk	A smaller piece of a document, optimized for search
Embedding	A numerical vector representing text meaning
pgvector	PostgreSQL extension for vector storage and search
HNSW	Graph-based index for fast approximate nearest-neighbor search
Cosine similarity	Measure of angle between two vectors (1.0 = identical meaning)
Strategy pattern	Design pattern where algorithm is selected at runtime

This is chapter 2 of AI HR Assistant.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 1: Data Lake

Ch. 3: Retrieval System