Back to guides
1
5 min

Data Lake

Ingest & Normalize HR Documents

Why an HR Data Lake?

An HR assistant is only as good as the documents it can access. But HR data is notoriously fragmented — policies live in PDFs, benefits in spreadsheets, org charts in HRIS systems, PTO records in payroll databases, and "how things actually work" lives in tribal knowledge and Slack threads.

Before any AI can answer "What's our parental leave policy for employees in California?", you need a unified ingestion pipeline that normalizes all of this into a common format the AI can search.

Key Concepts

The Document Interface

The core abstraction. Every piece of HR data — whether it's a handbook section, a benefit plan, a formal policy, an org chart entry, or a PTO record — becomes a Document with:

  • id — unique identifier (e.g., policy_POL-001)
  • content — the full text, formatted for readability
  • metadata — source type, category, effective date, applicable states, confidentiality level
  • source_type — which loader produced it
  • This is the universal contract that every downstream module depends on. The retrieval system doesn't care whether a document came from JSON or CSV — it just searches Document[].

    HR-Specific Metadata

    Unlike generic document systems, HR data carries metadata that's critical for compliance:

    Metadata FieldWhy It Matters
    `effective_date`Always cite the current policy version, not a superseded one
    `applicable_states`California PTO rules differ from Texas — filter by state
    `confidentiality`PTO balances are confidential; handbook is public
    `category`Route "leave" questions to leave policies, not expense policies
    `version`Track which policy version was cited (audit trail)

    Getting metadata right at ingestion time saves enormous complexity downstream. If you don't tag a policy with its applicable states during ingestion, you can't filter by state during retrieval.

    Loaders

    One loader per data source. Each loader knows how to:

  • Read its specific format (JSON, CSV)
  • Extract meaningful fields into metadata
  • Produce a narrative content field useful for RAG retrieval
  • Classify confidentiality level
  • The loader for org chart data is particularly interesting — it resolves manager relationships and lists direct reports, turning a flat employee list into a searchable hierarchy.

    Confidentiality Levels

    HR data has four tiers:

  • Public — Employee handbook (everyone can see it)
  • Internal — Policies, benefits guides (employees only, not public-facing)
  • Confidential — PTO balances, performance data (only the employee + HR)
  • Restricted — Salary data, investigation notes (HR only)
  • The ingestion pipeline tags each document with its confidentiality level. The retrieval system and AI gateway enforce access control based on these tags.

    Architecture Pattern

    JSON ──→ Handbook Loader ──┐
    JSON ──→ Benefits Loader ──┤
    JSON ──→ Policy Loader ────┤──→ Validate ──→ Document[]
    JSON ──→ Org Loader ───────┤
    CSV ───→ PTO Loader ───────┘

    Each loader is independent. Adding a new data source (training records, incident reports, job descriptions) means writing one new loader — nothing else changes.

    What You'll Build

  • Run the pre-seeded ingestion pipeline and see documents flow through from 5 HR data sources
  • Explore each data format and understand the HR-specific parsing patterns
  • Walk through the loader code, the Document interface, and the metadata model
  • Extend the pipeline with a new data source or improved validation
  • Glossary

    TermMeaning
    DocumentThe universal data unit — text + metadata + source info
    LoaderA function that reads one data format and returns Documents
    ConfidentialityAccess tier: public, internal, confidential, restricted
    Effective dateWhen a policy version took effect (critical for compliance)
    Ingestion pipelineThe full flow from raw files to validated Documents

    This is chapter 1 of AI HR Assistant.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details