Back to guides
1
5 min

Data Lake

Ingest Marketing Intelligence

Why a Marketing Data Lake?

Marketing intelligence starts with data — but marketing data is uniquely messy. It lives in competitor websites, social media APIs, campaign dashboards, PDF brand guidelines, and analyst reports. Before an AI can detect trends or draft on-brand content, you need a unified ingestion pipeline that normalizes everything into a common format.

The challenge is that marketing data is inherently temporal. A competitor's positioning statement from January is fundamentally different from their October positioning — even if the words are similar. Your data lake must capture not just *what* was said, but *when* it was observed.

Key Concepts

Document Interface

The core abstraction. Every piece of marketing data — whether it's a competitor profile, a campaign report, or a brand guideline — becomes a Document with:

  • id — unique identifier
  • content — the actual text
  • metadata — source type, date, competitor name, campaign ID, time period
  • source_type — which loader produced it (competitor, social, campaign, brand, industry)
  • This universal contract is what makes the downstream pipeline composable. The retrieval system doesn't care whether a document came from a social media API or a hand-written brand guide — it just searches Documents.

    Marketing-Specific Data Types

    Beyond the base Document interface, marketing intelligence benefits from typed data structures:

  • CompetitorProfile — positioning, features, pricing tiers, recent strategy changes, threat level. Each profile is a snapshot in time; multiple snapshots enable trend detection.
  • CampaignMetrics — spend, impressions, clicks, conversions, ROI, channel breakdown. The numbers that tell you what's working.
  • SocialMetrics — followers, engagement rate, sentiment score, top posts. The pulse of brand perception.
  • BrandGuidelines — voice, tone, messaging pillars, approved terminology, audience personas. The rules that content must follow.
  • Loaders

    One loader per data source. Each loader knows how to:

  • Read its specific format (JSON from APIs, CSV from exports, markdown from docs)
  • Extract meaningful fields into metadata — including temporal fields
  • Validate the output matches the Document interface
  • The key design decision: loaders are *stateless* and *idempotent*. You can re-run the pipeline anytime and get the same Documents from the same source files. This makes debugging trivial.

    Temporal Metadata

    Every document gets temporal metadata at ingestion time:

  • observation_date — when this data was captured or reported
  • time_period — the quarter or month this covers (e.g., "2024-Q4")
  • This seems like a small detail, but it's the foundation of trend detection. Without temporal metadata, you can't ask "how has competitor X changed?" — you can only ask "what does competitor X look like right now?"

    Architecture Pattern

    Competitor Pages  ──→ Competitor Loader ────┐
    Social API Export ──→ Social Loader ────────┤
    Campaign CSV      ──→ Campaign Loader ──────┤──→ Validate ──→ Document[]
    Brand Guide PDF   ──→ Brand Loader ─────────┤      (with temporal metadata)
    Industry Reports  ──→ Industry Loader ──────┘

    Each loader is independent. Adding a new data source (SEO rankings, email newsletters, pricing page snapshots) is one new loader — nothing else changes.

    Design Decisions

    Why not use a framework like LlamaIndex or LangChain for ingestion? Frameworks add abstraction layers that obscure what's happening. For a learning project, you want to see every line of code that touches your data. In production, many teams end up writing custom loaders anyway because framework loaders don't handle their specific data formats well.

    Why JSON for most data files? In production, you'd pull from APIs (social media, CRM, analytics platforms). For a learning sandbox, JSON files simulate API responses while being easy to inspect and modify. The loader interface stays the same — swap file reads for API calls when you go to production.

    Why validate at ingestion time? Bad data caught early is cheap to fix. Bad data caught after it's been chunked, embedded, and stored in a vector database is expensive to fix — you have to re-run the entire pipeline. Validation at the boundary is a core enterprise pattern.

    What You'll Build

  • Run the pre-seeded ingestion pipeline and see documents flow through from 5 marketing data sources
  • Explore competitor profiles, campaign reports, social metrics, brand guidelines, and industry reports
  • Walk through the loader code and the Document interface with temporal metadata
  • Extend the pipeline with a new data source or improved validation
  • Glossary

    TermMeaning
    DocumentThe universal data unit — text + metadata + source info + temporal data
    LoaderA function that reads one data format and returns Documents
    CompetitorProfileStructured competitor data — positioning, features, pricing, changes
    CampaignMetricsCampaign performance data — spend, impressions, conversions, ROI
    Temporal metadataWhen data was observed and what period it covers — enables trend detection
    Ingestion pipelineThe full flow from raw files to validated, temporally-tagged Documents

    This is chapter 1 of AI Marketing Intelligence.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details