Data Lake
Ingest & Normalize HR Documents
Why an HR Data Lake?
An HR assistant is only as good as the documents it can access. But HR data is notoriously fragmented — policies live in PDFs, benefits in spreadsheets, org charts in HRIS systems, PTO records in payroll databases, and "how things actually work" lives in tribal knowledge and Slack threads.
Before any AI can answer "What's our parental leave policy for employees in California?", you need a unified ingestion pipeline that normalizes all of this into a common format the AI can search.
Key Concepts
The Document Interface
The core abstraction. Every piece of HR data — whether it's a handbook section, a benefit plan, a formal policy, an org chart entry, or a PTO record — becomes a Document with:
policy_POL-001)This is the universal contract that every downstream module depends on. The retrieval system doesn't care whether a document came from JSON or CSV — it just searches Document[].
HR-Specific Metadata
Unlike generic document systems, HR data carries metadata that's critical for compliance:
| Metadata Field | Why It Matters |
|---|---|
| `effective_date` | Always cite the current policy version, not a superseded one |
| `applicable_states` | California PTO rules differ from Texas — filter by state |
| `confidentiality` | PTO balances are confidential; handbook is public |
| `category` | Route "leave" questions to leave policies, not expense policies |
| `version` | Track which policy version was cited (audit trail) |
Getting metadata right at ingestion time saves enormous complexity downstream. If you don't tag a policy with its applicable states during ingestion, you can't filter by state during retrieval.
Loaders
One loader per data source. Each loader knows how to:
content field useful for RAG retrievalThe loader for org chart data is particularly interesting — it resolves manager relationships and lists direct reports, turning a flat employee list into a searchable hierarchy.
Confidentiality Levels
HR data has four tiers:
The ingestion pipeline tags each document with its confidentiality level. The retrieval system and AI gateway enforce access control based on these tags.
Architecture Pattern
JSON ──→ Handbook Loader ──┐
JSON ──→ Benefits Loader ──┤
JSON ──→ Policy Loader ────┤──→ Validate ──→ Document[]
JSON ──→ Org Loader ───────┤
CSV ───→ PTO Loader ───────┘Each loader is independent. Adding a new data source (training records, incident reports, job descriptions) means writing one new loader — nothing else changes.
What You'll Build
Glossary
| Term | Meaning |
|---|---|
| Document | The universal data unit — text + metadata + source info |
| Loader | A function that reads one data format and returns Documents |
| Confidentiality | Access tier: public, internal, confidential, restricted |
| Effective date | When a policy version took effect (critical for compliance) |
| Ingestion pipeline | The full flow from raw files to validated Documents |
This is chapter 1 of AI HR Assistant.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details