Encoding Pipeline
Chunk & Embed HR Documents
From Policies to Vectors
A 2,000-word PTO policy can't be processed as a single unit for search. When an employee asks "How many vacation days do I get after 3 years?", the AI needs to find the specific paragraph about accrual rates — not the entire policy including blackout periods and carryover rules.
The encoding pipeline solves this:
Key Concepts
Chunking Strategies for HR Data
HR documents are structurally different from generic text. A PTO policy has numbered sections (3.1, 3.2), a handbook has named sections, and an org chart entry is already a single record. One chunking strategy doesn't fit all.
Fixed-size chunking — Split every ~1,500 characters with 200-character overlap. Simple and predictable. Works well for benefits guides where paragraphs are relatively uniform.
Document-aware chunking — Split on structural boundaries: section headers, numbered clauses, policy subsections. This ensures that "Section 4.3: Reporting Process" stays together as one chunk rather than being split across two chunks at an arbitrary character boundary.
Per-record chunking — For org chart entries and PTO records, each record is already small enough to be a single chunk. Using a very large chunk size (5,000 characters) effectively keeps each record intact.
Why Chunking Strategy Matters for Compliance
Consider this scenario: the harassment policy says "Reports must be filed within 30 days" in one section and "Investigations will be completed within 15 business days" in another. If a naive chunker splits these into the same chunk, the AI might conflate the two timelines. Document-aware chunking preserves the section boundaries, keeping each requirement in its own retrievable unit.
Metadata Per Chunk
Each chunk inherits and extends its parent document's metadata:
Document metadata Chunk metadata
───────────────── ─────────────────
source_type: "policy" source_type: "policy"
category: "leave" category: "leave"
effective_date: "2025" effective_date: "2025"
applicable_states: [CA] applicable_states: [CA]
confidentiality: "int" confidentiality: "internal"
chunk_index: 3
section_heading: "Carryover"This metadata is critical for filtering in Module 3. When someone asks "What's the PTO policy in California?", the retrieval system can filter chunks by category: "leave" AND applicable_states: ["CA", "all"] before running vector similarity — dramatically improving precision.
Embeddings
An embedding is a numerical vector (384 to 1,536 dimensions) that captures the *meaning* of text. Similar concepts produce vectors that are close together:
"parental leave for new mothers" → [0.23, -0.14, 0.67, ...]
"maternity leave policy" → [0.25, -0.12, 0.65, ...] ← very close!
"expense reimbursement rates" → [-0.44, 0.91, -0.03, ...] ← far awayThis is why semantic search beats keyword search for HR — an employee asking about "maternity leave" finds the "parental leave" policy even though the exact word doesn't appear.
Vector Storage with pgvector
pgvector extends PostgreSQL with vector data types and similarity search. For an HR system, this is ideal because:
HNSW vs IVFFlat: HNSW builds a graph structure that's slower to build but faster to query. For HR data (typically <50K vectors), HNSW is the clear winner — no training step needed, and query latency is sub-millisecond.
Architecture Pattern
Document[] ──→ Chunker ──→ Chunk[] ──→ Embedder ──→ Vector[] ──→ pgvector
│ │
Strategy pattern Batch + retry
(fixed vs document-aware) (rate limit handling)Design Tradeoffs
| Decision | Tradeoff |
|---|---|
| Chunk size (1,500 chars) | Larger = more context per chunk, fewer chunks. Smaller = more precise retrieval but risk losing context |
| Overlap (200 chars) | Prevents losing meaning at chunk boundaries. Costs more storage but improves recall |
| Local vs API embeddings | Local (Xenova) = free, private, slower. API (OpenAI) = better quality, costs money, data leaves your network |
| HNSW ef_construction (64) | Higher = better recall, slower index build. 64 is a good default for <100K vectors |
What You'll Build
Glossary
| Term | Meaning |
|---|---|
| Chunk | A smaller piece of a document, optimized for search |
| Embedding | A numerical vector representing text meaning |
| pgvector | PostgreSQL extension for vector storage and search |
| HNSW | Graph-based index for fast approximate nearest-neighbor search |
| Cosine similarity | Measure of angle between two vectors (1.0 = identical meaning) |
| Strategy pattern | Design pattern where algorithm is selected at runtime |
This is chapter 2 of AI HR Assistant.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details