Back to guides
1
5 min

Data Lake

Ingest & Normalize Support Data

Why a Support Data Lake?

An AI support agent is only as good as the data behind it. But support data is notoriously fragmented — tickets live in Zendesk, knowledge base articles in Confluence, product docs in Notion, escalation rules in a spreadsheet, and satisfaction scores in a survey tool.

Before any AI can answer "My account login isn't working after the password reset — can you help?", you need a unified ingestion pipeline that normalizes all of this into a common format the AI can search, classify, and reason over.

Key Concepts

The Document Interface

The core abstraction. Every piece of support data — whether it's a ticket, KB article, product doc, escalation rule, or CSAT survey — becomes a Document with:

  • id — unique identifier (e.g., ticket_TKT-00001, kb_KB-001)
  • content — the full text, formatted for readability
  • metadata — source type, priority, channel, tags, category
  • source_type — which loader produced it
  • This is the universal contract that every downstream module depends on. The classification system doesn't care whether a document came from Zendesk or Confluence — it just processes Document[].

    Support-Specific Metadata

    Unlike generic document systems, support data carries metadata critical for triage and routing:

    Metadata FieldWhy It Matters
    `priority`Routes urgent tickets ahead of low-priority ones
    `channel`Email, chat, phone need different response styles
    `tags`Enable topic-based routing and analytics
    `customer_id`Links to customer context (plan, history, CSAT)
    `resolution_time`Tracks SLA compliance
    `csat_score`Measures response quality

    Getting metadata right at ingestion time saves enormous complexity downstream. If you don't tag a ticket with its channel during ingestion, you can't personalize the response style later.

    Loaders

    One loader per data source. Each loader knows how to:

  • Read its specific format (JSON, CSV)
  • Extract meaningful fields into metadata
  • Produce a narrative content field useful for search and classification
  • Handle edge cases (quoted CSV fields, nested JSON)
  • Multi-Channel Normalization

    Tickets arrive via email, chat, phone, and web forms. Each channel has different data shapes:

  • Email: subject line, body, attachments, sender
  • Chat: message history, timestamps, agent assignment
  • Phone: transcript, duration, callback number
  • Web form: structured fields, dropdowns, file uploads
  • The loader normalizes all of these into the same Document format. Downstream systems see a unified stream.

    Architecture Pattern

    JSON ──→ Ticket Loader ──────┐
    JSON ──→ KB Loader ──────────┤
    JSON ──→ Product Doc Loader ─┤──→ Validate ──→ Document[]
    JSON ──→ Escalation Loader ──┤
    CSV ───→ CSAT Loader ────────┘

    Each loader is independent. Adding a new data source (chat transcripts, internal notes, changelog) means writing one new loader — nothing else changes.

    What You'll Build

  • Run the pre-seeded ingestion pipeline and see documents flow from 5 support data sources
  • Explore each data format and understand support-specific parsing patterns
  • Walk through the loader code, Document interface, and metadata model
  • Extend the pipeline with a new data source or improved validation
  • Glossary

    TermMeaning
    DocumentNormalized unit of data from any support source
    LoaderFunction that reads a specific format and returns Documents
    MetadataStructured fields (priority, tags, channel) for filtering and routing
    Data lakeUnified storage that combines all support data sources
    SLAService Level Agreement — target response/resolution times by priority

    This is chapter 1 of AI Customer Support Agent.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details