4 min

Data Lake

Ingest & Normalize

Why a Data Lake?

Every AI system starts with data. But enterprise data is messy — it lives in CSVs, JSON APIs, plain text files, Markdown documents, and databases. Before an AI can use any of it, you need a unified ingestion pipeline that normalizes everything into a common format.

Think of it like this: a sales rep's world is scattered across CRM records, call transcripts, product specs, support tickets, and competitor intel. The Data Lake is the first step in making all of that searchable by AI.

Key Concepts

Document Interface

The core abstraction. Every piece of data — whether it's a CRM row, a call transcript, or a product spec — becomes a Document with:

id — unique identifier

content — the actual text

metadata — source type, date, account name, tags

source_type — which loader produced it

This is the universal contract that every downstream module depends on. Get this right and everything else composes cleanly.

Loaders

One loader per data source. Each loader knows how to:

Read its specific format (CSV parsing, JSON deserialization, text splitting)

Extract meaningful fields into metadata

Validate the output matches the Document interface

The pattern is intentionally simple — a function that takes a file path and returns Document[]. No frameworks, no magic. You can read any loader top to bottom and understand it completely.

Validation & Deduplication

Before documents enter the pipeline:

Schema validation — does every document have required fields?

Content validation — is the content non-empty and reasonable?

Deduplication — have we seen this document ID before?

These checks catch data quality issues early, before they poison your embeddings and search results downstream.

Architecture Pattern

CSV ──→ CRM Loader ────┐
JSON ─→ Product Loader ─┤
TXT ──→ Transcript Loader ┤──→ Validate ──→ Document[]
JSON ─→ Ticket Loader ──┤
MD ───→ Competitor Loader ┘

Each loader is independent. You can add a new data source by writing one new loader — nothing else changes. This is the Open/Closed Principle in practice.

What You'll Build

Run the pre-seeded ingestion pipeline and see 24 documents flow through

Explore each data format and understand the parsing patterns

Walk through the loader code and the Document interface

Extend the pipeline with a new data source or improved validation

Glossary

Term	Meaning
Document	The universal data unit — text + metadata + source info
Loader	A function that reads one data format and returns Documents
Normalization	Converting diverse formats into a single schema
Metadata	Structured info about a document (source, date, account)
Ingestion pipeline	The full flow from raw files to validated Documents

This is chapter 1 of AI Sales Companion.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Encoding Pipeline