5 min

Encoding Pipeline

Time-Aware Embeddings

From Documents to Searchable Vectors

Raw documents are too large and unstructured for an AI to consume efficiently. The encoding pipeline transforms them into small, searchable chunks with vector embeddings — numerical representations of meaning that enable semantic search.

For marketing intelligence, there's an additional requirement: time-awareness. A standard RAG pipeline treats all chunks equally. A marketing intelligence pipeline must know that Q4 competitor data is more relevant than Q1 data when answering "what are competitors doing now?" — and that both are needed when answering "how have competitors changed this year?"

Chunking Strategies

Chunking is the art of splitting documents into pieces that are small enough for precise retrieval but large enough to preserve context. The right strategy depends on the document type.

Fixed-Size Chunking

The simplest approach: split text into windows of N characters with M characters of overlap. The overlap ensures that information at chunk boundaries isn't lost.

For marketing data, 1500 characters with 200-character overlap works well — large enough to capture a complete competitor positioning statement, small enough to avoid mixing unrelated metrics.

Document-Aware Chunking

More sophisticated: split on structural boundaries (section headers, paragraph breaks, list boundaries). This preserves logical units of information.

For competitor profiles, this means each section (positioning, features, pricing, recent changes) becomes its own chunk. For campaign reports, each campaign stays together. The AI can then retrieve "CompetitorX's pricing" without also getting their feature list.

Source-Specific Strategies

Source Type	Strategy	Why
Competitor profiles	Section-aware	Positioning, features, pricing are distinct retrievable units
Campaign reports	Metric-grouped	Keep spend/impressions/conversions together for coherent analysis
Social metrics	Single chunk per platform	Platform metrics are compact and interdependent
Brand guidelines	Section-aware	Tone rules, messaging pillars, personas are distinct topics
Industry reports	Finding-based	Each key finding or prediction is an independent insight

Time-Aware Metadata

This is the critical differentiator between a basic RAG pipeline and a marketing intelligence system.

Observation Date

Every chunk gets an observation_date — when this specific data point was captured or reported. For a competitor profile scraped on November 15, all chunks get that date. For a Q4 campaign report, chunks get the report date.

Time Period

Chunks also get a time_period field indicating what period the data covers. A campaign report covering October gets time_period: "2024-10". An industry report covering Q4 gets time_period: "2024-Q4".

Freshness Score

A computed score (0 to 1) that decays over time. A chunk from yesterday has a freshness score near 1.0. A chunk from 6 months ago has a score near 0.2. The decay function is exponential with a configurable half-life — 90 days works well for most marketing data.

freshness = e^(-0.693 * days_old / half_life)

This isn't stored in the embedding itself — it's metadata that the retrieval system uses during reranking. But it must be computed and stored at encoding time so retrieval can use it efficiently.

Embedding Generation

Embeddings convert text chunks into dense vectors that capture semantic meaning. Two chunks about "market positioning" will have similar vectors even if they use different words.

For a learning project, local embeddings (via @xenova/transformers with the all-MiniLM-L6-v2 model) are ideal: no API costs, no rate limits, deterministic results. The model produces 384-dimensional vectors — smaller than OpenAI's 1536 dimensions but sufficient for our dataset size.

Why Not Use OpenAI Embeddings?

In production, you might. But for development:

Local embeddings are free and instant

No API key management during development

Deterministic: same input always produces same output (useful for testing)

The retrieval patterns you learn transfer to any embedding provider

pgvector Storage

PostgreSQL with the pgvector extension gives you a production-grade vector database without managing a separate system. Your marketing chunks table needs:

CREATE TABLE marketing_chunks (
  id TEXT PRIMARY KEY,
  document_id TEXT NOT NULL,
  content TEXT NOT NULL,
  chunk_index INTEGER NOT NULL,
  metadata JSONB NOT NULL DEFAULT '{}',
  embedding vector(384),
  observation_date TIMESTAMPTZ,
  created_at TIMESTAMPTZ DEFAULT NOW()
);

HNSW vs IVFFlat

Two indexing strategies for vector search:

IVFFlat: Clusters vectors into buckets, searches nearby buckets. Fast but requires a training step with representative data. Recall degrades if data distribution changes.

HNSW: Builds a navigable small-world graph. No training step, consistent recall, supports concurrent inserts. Slightly more memory but much better for evolving datasets.

For marketing intelligence, HNSW is the clear winner. Your data changes constantly (new competitor snapshots, new campaign results), and HNSW handles inserts without degradation.

The Full Pipeline

Documents → Chunk (source-aware) → Enrich (temporal metadata)
         → Embed (local model)  → Store (pgvector + HNSW)

Running npm run encode executes the entire pipeline: ingest → chunk → embed → store. The output tells you exactly how many documents became how many chunks with what embedding dimensions.

What You'll Build

Design chunking strategies for each marketing data type

Implement time-aware metadata enrichment (observation dates, time periods, freshness scores)

Set up pgvector storage with temporal columns and HNSW indexing

Run the full encoding pipeline and verify chunk quality

Glossary

Term	Meaning
Chunk	A slice of a Document, sized for embedding and retrieval
Embedding	A dense vector (array of numbers) representing the meaning of text
pgvector	PostgreSQL extension for storing and querying vector embeddings
HNSW	Hierarchical Navigable Small World — an index for fast approximate nearest neighbor search
Freshness score	A time-decaying score (0-1) indicating how recent a chunk's data is
Observation date	When the data in a chunk was captured or reported

This is chapter 2 of AI Marketing Intelligence.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 1: Data Lake

Ch. 3: Retrieval System