5 min

Data Lake

Ingest Marketing Intelligence

Why a Marketing Data Lake?

Marketing intelligence starts with data — but marketing data is uniquely messy. It lives in competitor websites, social media APIs, campaign dashboards, PDF brand guidelines, and analyst reports. Before an AI can detect trends or draft on-brand content, you need a unified ingestion pipeline that normalizes everything into a common format.

The challenge is that marketing data is inherently temporal. A competitor's positioning statement from January is fundamentally different from their October positioning — even if the words are similar. Your data lake must capture not just *what* was said, but *when* it was observed.

Key Concepts

Document Interface

The core abstraction. Every piece of marketing data — whether it's a competitor profile, a campaign report, or a brand guideline — becomes a Document with:

id — unique identifier

content — the actual text

metadata — source type, date, competitor name, campaign ID, time period

source_type — which loader produced it (competitor, social, campaign, brand, industry)

This universal contract is what makes the downstream pipeline composable. The retrieval system doesn't care whether a document came from a social media API or a hand-written brand guide — it just searches Documents.

Marketing-Specific Data Types

Beyond the base Document interface, marketing intelligence benefits from typed data structures:

CompetitorProfile — positioning, features, pricing tiers, recent strategy changes, threat level. Each profile is a snapshot in time; multiple snapshots enable trend detection.

CampaignMetrics — spend, impressions, clicks, conversions, ROI, channel breakdown. The numbers that tell you what's working.

SocialMetrics — followers, engagement rate, sentiment score, top posts. The pulse of brand perception.

BrandGuidelines — voice, tone, messaging pillars, approved terminology, audience personas. The rules that content must follow.

Loaders

One loader per data source. Each loader knows how to:

Read its specific format (JSON from APIs, CSV from exports, markdown from docs)

Extract meaningful fields into metadata — including temporal fields

Validate the output matches the Document interface

The key design decision: loaders are *stateless* and *idempotent*. You can re-run the pipeline anytime and get the same Documents from the same source files. This makes debugging trivial.

Temporal Metadata

Every document gets temporal metadata at ingestion time:

observation_date — when this data was captured or reported

time_period — the quarter or month this covers (e.g., "2024-Q4")

This seems like a small detail, but it's the foundation of trend detection. Without temporal metadata, you can't ask "how has competitor X changed?" — you can only ask "what does competitor X look like right now?"

Architecture Pattern

Competitor Pages  ──→ Competitor Loader ────┐
Social API Export ──→ Social Loader ────────┤
Campaign CSV      ──→ Campaign Loader ──────┤──→ Validate ──→ Document[]
Brand Guide PDF   ──→ Brand Loader ─────────┤      (with temporal metadata)
Industry Reports  ──→ Industry Loader ──────┘

Each loader is independent. Adding a new data source (SEO rankings, email newsletters, pricing page snapshots) is one new loader — nothing else changes.

Design Decisions

Why not use a framework like LlamaIndex or LangChain for ingestion? Frameworks add abstraction layers that obscure what's happening. For a learning project, you want to see every line of code that touches your data. In production, many teams end up writing custom loaders anyway because framework loaders don't handle their specific data formats well.

Why JSON for most data files? In production, you'd pull from APIs (social media, CRM, analytics platforms). For a learning sandbox, JSON files simulate API responses while being easy to inspect and modify. The loader interface stays the same — swap file reads for API calls when you go to production.

Why validate at ingestion time? Bad data caught early is cheap to fix. Bad data caught after it's been chunked, embedded, and stored in a vector database is expensive to fix — you have to re-run the entire pipeline. Validation at the boundary is a core enterprise pattern.

What You'll Build

Run the pre-seeded ingestion pipeline and see documents flow through from 5 marketing data sources

Explore competitor profiles, campaign reports, social metrics, brand guidelines, and industry reports

Walk through the loader code and the Document interface with temporal metadata

Extend the pipeline with a new data source or improved validation

Glossary

Term	Meaning
Document	The universal data unit — text + metadata + source info + temporal data
Loader	A function that reads one data format and returns Documents
CompetitorProfile	Structured competitor data — positioning, features, pricing, changes
CampaignMetrics	Campaign performance data — spend, impressions, conversions, ROI
Temporal metadata	When data was observed and what period it covers — enables trend detection
Ingestion pipeline	The full flow from raw files to validated, temporally-tagged Documents

This is chapter 1 of AI Marketing Intelligence.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Encoding Pipeline