Back to guides
1
4 min

Data Lake

Ingest & Normalize

Why a Data Lake?

Every AI system starts with data. But enterprise data is messy — it lives in CSVs, JSON APIs, plain text files, Markdown documents, and databases. Before an AI can use any of it, you need a unified ingestion pipeline that normalizes everything into a common format.

Think of it like this: a sales rep's world is scattered across CRM records, call transcripts, product specs, support tickets, and competitor intel. The Data Lake is the first step in making all of that searchable by AI.

Key Concepts

Document Interface

The core abstraction. Every piece of data — whether it's a CRM row, a call transcript, or a product spec — becomes a Document with:

  • id — unique identifier
  • content — the actual text
  • metadata — source type, date, account name, tags
  • source_type — which loader produced it
  • This is the universal contract that every downstream module depends on. Get this right and everything else composes cleanly.

    Loaders

    One loader per data source. Each loader knows how to:

  • Read its specific format (CSV parsing, JSON deserialization, text splitting)
  • Extract meaningful fields into metadata
  • Validate the output matches the Document interface
  • The pattern is intentionally simple — a function that takes a file path and returns Document[]. No frameworks, no magic. You can read any loader top to bottom and understand it completely.

    Validation & Deduplication

    Before documents enter the pipeline:

  • Schema validation — does every document have required fields?
  • Content validation — is the content non-empty and reasonable?
  • Deduplication — have we seen this document ID before?
  • These checks catch data quality issues early, before they poison your embeddings and search results downstream.

    Architecture Pattern

    CSV ──→ CRM Loader ────┐
    JSON ─→ Product Loader ─┤
    TXT ──→ Transcript Loader ┤──→ Validate ──→ Document[]
    JSON ─→ Ticket Loader ──┤
    MD ───→ Competitor Loader ┘

    Each loader is independent. You can add a new data source by writing one new loader — nothing else changes. This is the Open/Closed Principle in practice.

    What You'll Build

  • Run the pre-seeded ingestion pipeline and see 24 documents flow through
  • Explore each data format and understand the parsing patterns
  • Walk through the loader code and the Document interface
  • Extend the pipeline with a new data source or improved validation
  • Glossary

    TermMeaning
    DocumentThe universal data unit — text + metadata + source info
    LoaderA function that reads one data format and returns Documents
    NormalizationConverting diverse formats into a single schema
    MetadataStructured info about a document (source, date, account)
    Ingestion pipelineThe full flow from raw files to validated Documents

    This is chapter 1 of AI Sales Companion.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details