5 min

Data Lake

Ingest & Normalize Support Data

Why a Support Data Lake?

An AI support agent is only as good as the data behind it. But support data is notoriously fragmented — tickets live in Zendesk, knowledge base articles in Confluence, product docs in Notion, escalation rules in a spreadsheet, and satisfaction scores in a survey tool.

Before any AI can answer "My account login isn't working after the password reset — can you help?", you need a unified ingestion pipeline that normalizes all of this into a common format the AI can search, classify, and reason over.

Key Concepts

The Document Interface

The core abstraction. Every piece of support data — whether it's a ticket, KB article, product doc, escalation rule, or CSAT survey — becomes a Document with:

id — unique identifier (e.g., ticket_TKT-00001, kb_KB-001)

content — the full text, formatted for readability

metadata — source type, priority, channel, tags, category

source_type — which loader produced it

This is the universal contract that every downstream module depends on. The classification system doesn't care whether a document came from Zendesk or Confluence — it just processes Document[].

Support-Specific Metadata

Unlike generic document systems, support data carries metadata critical for triage and routing:

Metadata Field	Why It Matters
`priority`	Routes urgent tickets ahead of low-priority ones
`channel`	Email, chat, phone need different response styles
`tags`	Enable topic-based routing and analytics
`customer_id`	Links to customer context (plan, history, CSAT)
`resolution_time`	Tracks SLA compliance
`csat_score`	Measures response quality

Getting metadata right at ingestion time saves enormous complexity downstream. If you don't tag a ticket with its channel during ingestion, you can't personalize the response style later.

Loaders

One loader per data source. Each loader knows how to:

Read its specific format (JSON, CSV)

Extract meaningful fields into metadata

Produce a narrative content field useful for search and classification

Handle edge cases (quoted CSV fields, nested JSON)

Multi-Channel Normalization

Tickets arrive via email, chat, phone, and web forms. Each channel has different data shapes:

Email: subject line, body, attachments, sender

Chat: message history, timestamps, agent assignment

Phone: transcript, duration, callback number

Web form: structured fields, dropdowns, file uploads

The loader normalizes all of these into the same Document format. Downstream systems see a unified stream.

Architecture Pattern

JSON ──→ Ticket Loader ──────┐
JSON ──→ KB Loader ──────────┤
JSON ──→ Product Doc Loader ─┤──→ Validate ──→ Document[]
JSON ──→ Escalation Loader ──┤
CSV ───→ CSAT Loader ────────┘

Each loader is independent. Adding a new data source (chat transcripts, internal notes, changelog) means writing one new loader — nothing else changes.

What You'll Build

Run the pre-seeded ingestion pipeline and see documents flow from 5 support data sources

Explore each data format and understand support-specific parsing patterns

Walk through the loader code, Document interface, and metadata model

Extend the pipeline with a new data source or improved validation

Glossary

Term	Meaning
Document	Normalized unit of data from any support source
Loader	Function that reads a specific format and returns Documents
Metadata	Structured fields (priority, tags, channel) for filtering and routing
Data lake	Unified storage that combines all support data sources
SLA	Service Level Agreement — target response/resolution times by priority

This is chapter 1 of AI Customer Support Agent.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Intent & Classification