Encoding Pipeline
Time-Aware Embeddings
From Documents to Searchable Vectors
Raw documents are too large and unstructured for an AI to consume efficiently. The encoding pipeline transforms them into small, searchable chunks with vector embeddings — numerical representations of meaning that enable semantic search.
For marketing intelligence, there's an additional requirement: time-awareness. A standard RAG pipeline treats all chunks equally. A marketing intelligence pipeline must know that Q4 competitor data is more relevant than Q1 data when answering "what are competitors doing now?" — and that both are needed when answering "how have competitors changed this year?"
Chunking Strategies
Chunking is the art of splitting documents into pieces that are small enough for precise retrieval but large enough to preserve context. The right strategy depends on the document type.
Fixed-Size Chunking
The simplest approach: split text into windows of N characters with M characters of overlap. The overlap ensures that information at chunk boundaries isn't lost.
For marketing data, 1500 characters with 200-character overlap works well — large enough to capture a complete competitor positioning statement, small enough to avoid mixing unrelated metrics.
Document-Aware Chunking
More sophisticated: split on structural boundaries (section headers, paragraph breaks, list boundaries). This preserves logical units of information.
For competitor profiles, this means each section (positioning, features, pricing, recent changes) becomes its own chunk. For campaign reports, each campaign stays together. The AI can then retrieve "CompetitorX's pricing" without also getting their feature list.
Source-Specific Strategies
| Source Type | Strategy | Why |
|---|---|---|
| Competitor profiles | Section-aware | Positioning, features, pricing are distinct retrievable units |
| Campaign reports | Metric-grouped | Keep spend/impressions/conversions together for coherent analysis |
| Social metrics | Single chunk per platform | Platform metrics are compact and interdependent |
| Brand guidelines | Section-aware | Tone rules, messaging pillars, personas are distinct topics |
| Industry reports | Finding-based | Each key finding or prediction is an independent insight |
Time-Aware Metadata
This is the critical differentiator between a basic RAG pipeline and a marketing intelligence system.
Observation Date
Every chunk gets an observation_date — when this specific data point was captured or reported. For a competitor profile scraped on November 15, all chunks get that date. For a Q4 campaign report, chunks get the report date.
Time Period
Chunks also get a time_period field indicating what period the data covers. A campaign report covering October gets time_period: "2024-10". An industry report covering Q4 gets time_period: "2024-Q4".
Freshness Score
A computed score (0 to 1) that decays over time. A chunk from yesterday has a freshness score near 1.0. A chunk from 6 months ago has a score near 0.2. The decay function is exponential with a configurable half-life — 90 days works well for most marketing data.
freshness = e^(-0.693 * days_old / half_life)This isn't stored in the embedding itself — it's metadata that the retrieval system uses during reranking. But it must be computed and stored at encoding time so retrieval can use it efficiently.
Embedding Generation
Embeddings convert text chunks into dense vectors that capture semantic meaning. Two chunks about "market positioning" will have similar vectors even if they use different words.
For a learning project, local embeddings (via @xenova/transformers with the all-MiniLM-L6-v2 model) are ideal: no API costs, no rate limits, deterministic results. The model produces 384-dimensional vectors — smaller than OpenAI's 1536 dimensions but sufficient for our dataset size.
Why Not Use OpenAI Embeddings?
In production, you might. But for development:
pgvector Storage
PostgreSQL with the pgvector extension gives you a production-grade vector database without managing a separate system. Your marketing chunks table needs:
CREATE TABLE marketing_chunks (
id TEXT PRIMARY KEY,
document_id TEXT NOT NULL,
content TEXT NOT NULL,
chunk_index INTEGER NOT NULL,
metadata JSONB NOT NULL DEFAULT '{}',
embedding vector(384),
observation_date TIMESTAMPTZ,
created_at TIMESTAMPTZ DEFAULT NOW()
);HNSW vs IVFFlat
Two indexing strategies for vector search:
For marketing intelligence, HNSW is the clear winner. Your data changes constantly (new competitor snapshots, new campaign results), and HNSW handles inserts without degradation.
The Full Pipeline
Documents → Chunk (source-aware) → Enrich (temporal metadata)
→ Embed (local model) → Store (pgvector + HNSW)Running npm run encode executes the entire pipeline: ingest → chunk → embed → store. The output tells you exactly how many documents became how many chunks with what embedding dimensions.
What You'll Build
Glossary
| Term | Meaning |
|---|---|
| Chunk | A slice of a Document, sized for embedding and retrieval |
| Embedding | A dense vector (array of numbers) representing the meaning of text |
| pgvector | PostgreSQL extension for storing and querying vector embeddings |
| HNSW | Hierarchical Navigable Small World — an index for fast approximate nearest neighbor search |
| Freshness score | A time-decaying score (0-1) indicating how recent a chunk's data is |
| Observation date | When the data in a chunk was captured or reported |
This is chapter 2 of AI Marketing Intelligence.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details