5 min

Data Lake

Ingest & Normalize HR Documents

Why an HR Data Lake?

An HR assistant is only as good as the documents it can access. But HR data is notoriously fragmented — policies live in PDFs, benefits in spreadsheets, org charts in HRIS systems, PTO records in payroll databases, and "how things actually work" lives in tribal knowledge and Slack threads.

Before any AI can answer "What's our parental leave policy for employees in California?", you need a unified ingestion pipeline that normalizes all of this into a common format the AI can search.

Key Concepts

The Document Interface

The core abstraction. Every piece of HR data — whether it's a handbook section, a benefit plan, a formal policy, an org chart entry, or a PTO record — becomes a Document with:

id — unique identifier (e.g., policy_POL-001)

content — the full text, formatted for readability

metadata — source type, category, effective date, applicable states, confidentiality level

source_type — which loader produced it

This is the universal contract that every downstream module depends on. The retrieval system doesn't care whether a document came from JSON or CSV — it just searches Document[].

HR-Specific Metadata

Unlike generic document systems, HR data carries metadata that's critical for compliance:

Metadata Field	Why It Matters
`effective_date`	Always cite the current policy version, not a superseded one
`applicable_states`	California PTO rules differ from Texas — filter by state
`confidentiality`	PTO balances are confidential; handbook is public
`category`	Route "leave" questions to leave policies, not expense policies
`version`	Track which policy version was cited (audit trail)

Getting metadata right at ingestion time saves enormous complexity downstream. If you don't tag a policy with its applicable states during ingestion, you can't filter by state during retrieval.

Loaders

One loader per data source. Each loader knows how to:

Read its specific format (JSON, CSV)

Extract meaningful fields into metadata

Produce a narrative content field useful for RAG retrieval

Classify confidentiality level

The loader for org chart data is particularly interesting — it resolves manager relationships and lists direct reports, turning a flat employee list into a searchable hierarchy.

Confidentiality Levels

HR data has four tiers:

Public — Employee handbook (everyone can see it)

Internal — Policies, benefits guides (employees only, not public-facing)

Confidential — PTO balances, performance data (only the employee + HR)

Restricted — Salary data, investigation notes (HR only)

The ingestion pipeline tags each document with its confidentiality level. The retrieval system and AI gateway enforce access control based on these tags.

Architecture Pattern

JSON ──→ Handbook Loader ──┐
JSON ──→ Benefits Loader ──┤
JSON ──→ Policy Loader ────┤──→ Validate ──→ Document[]
JSON ──→ Org Loader ───────┤
CSV ───→ PTO Loader ───────┘

Each loader is independent. Adding a new data source (training records, incident reports, job descriptions) means writing one new loader — nothing else changes.

What You'll Build

Run the pre-seeded ingestion pipeline and see documents flow through from 5 HR data sources

Explore each data format and understand the HR-specific parsing patterns

Walk through the loader code, the Document interface, and the metadata model

Extend the pipeline with a new data source or improved validation

Glossary

Term	Meaning
Document	The universal data unit — text + metadata + source info
Loader	A function that reads one data format and returns Documents
Confidentiality	Access tier: public, internal, confidential, restricted
Effective date	When a policy version took effect (critical for compliance)
Ingestion pipeline	The full flow from raw files to validated Documents

This is chapter 1 of AI HR Assistant.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Encoding Pipeline