5 min

Encoding Pipeline

Table-Aware Chunking & Embeddings

Why Financial Chunking Is Different

Standard RAG chunking splits text at fixed token boundaries — 512 tokens, slide forward by 50, repeat. This works for blog posts and documentation, but it destroys financial data.

Consider an income statement:

	Q3 2024	Q3 2023	Change
Revenue	$4,210M	$3,890M	+8.2%
Cost of Revenue	$2,105M	$1,984M	+6.1%
Gross Profit	$2,105M	$1,906M	+10.4%
Gross Margin	50.0%	49.0%	+100bps

If your chunker splits this table at row 3, the AI sees "Revenue: $4,210M" in one chunk and "Gross Margin: 50.0%" in another. It cannot verify that the margin is correct (Gross Profit / Revenue = 50.0%). Worse, it might hallucinate a margin by dividing numbers from different chunks.

Rule: Financial tables must be kept intact as single chunks.

Three Chunking Strategies

1. Table-Aware Chunking

Detect financial tables in filings and preserve them as atomic units. A table is detected by:

Multiple columns of numbers with $, %, or M/B/K suffixes

Row headers matching known financial line items (Revenue, Net Income, EPS, etc.)

Consistent column alignment or delimiter patterns

Each table becomes one chunk with metadata:

is_table: true

table_type: "income_statement" | "balance_sheet" | "cash_flow" | "other"

period: The fiscal period(s) covered

metrics: Array of metric names found (["revenue", "gross_margin", "ebitda"])

2. Narrative Chunking

For non-table content (MD&A, risk factors, executive commentary), use section-aware chunking:

Split at section headers (## Risk Factors, ## Liquidity, etc.)

If a section exceeds 512 tokens, split at paragraph boundaries within the section

50-token overlap between chunks to maintain context continuity

Preserve section name in metadata for filtering

3. Speaker-Turn Chunking

For earnings call Q&A sections:

Each analyst question + executive answer = one chunk

Preserve speaker names and titles in metadata

Tag with topic if detectable (e.g., "margins", "guidance", "competition")

This means a query like "What did the CFO say about margins?" can retrieve the exact exchange where the CFO discussed margin trends.

Metadata Enrichment

Financial metadata is the key to precise retrieval. Every chunk gets:

Field	Example	Why It Matters
`ticker`	"NVDA"	Filter by company
`fiscal_period`	"Q3-2024"	Filter by time period
`filing_type`	"10-K"	Distinguish annual vs quarterly
`section_name`	"MD&A"	Filter by document section
`is_table`	true	Identify structured vs narrative data
`metrics`	["revenue", "margin"]	Find chunks with specific financial metrics
`speaker`	"CFO John Smith"	Filter earnings call by speaker
`data_date`	"2024-10-28"	Data freshness tracking

This metadata powers the structured filters in Module 3. Without it, you'd need to rely entirely on semantic search — which cannot reliably find "NVDA's Q3 2024 gross margin."

Embedding Considerations for Financial Data

Financial text poses unique challenges for embedding models:

Number Blindness

Most embedding models treat numbers as tokens, not quantities. "Revenue of $4.2 billion" and "Revenue of $420 million" produce similar embeddings because the words around them are the same. The model doesn't understand that these are 10x different.

Mitigation: Store raw numbers as structured metadata, not just in the text. Use metadata filters for numerical queries, not just vector similarity.

Jargon Density

Financial text is dense with domain-specific terms: EBITDA, basis points, diluted EPS, free cash flow yield. General-purpose embedding models may not distinguish between "operating margin" and "gross margin" as effectively as a domain-specific model would.

Mitigation: Use a model fine-tuned on financial text if available. Otherwise, ensure your hybrid search (Module 3) combines embeddings with keyword matching for financial terms.

Table Embeddings

Embedding a financial table as plain text loses the structural relationships between rows and columns. A table like "Revenue: $4.2B | Net Income: $890M" flattened to text becomes ambiguous.

Mitigation: When embedding tables, prepend a natural language summary: "Income statement for NVDA Q3 2024 showing revenue of $4.2B, net income of $890M, gross margin of 50.0%." This gives the embedding model semantic context for the numerical data.

pgvector Storage

The finance_chunks table stores everything:

CREATE TABLE finance_chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id TEXT NOT NULL,
  chunk_index INT NOT NULL,
  content TEXT NOT NULL,
  embedding vector(384) NOT NULL,
  metadata JSONB NOT NULL,
  created_at TIMESTAMPTZ DEFAULT now()
);

CREATE INDEX ON finance_chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

Why HNSW over IVFFlat? HNSW delivers consistent sub-millisecond queries without needing periodic rebuilds. IVFFlat is faster to build but degrades as data is inserted. For financial data that's continuously updated (new filings, market data), HNSW's insert-without-rebuild property is essential.

What You'll Build

Table detection heuristic that identifies income statements, balance sheets, and cash flow tables

Table-aware chunker that keeps financial tables as atomic chunks

Speaker-turn chunker for earnings call Q&A sections

Metadata enrichment pipeline with financial fields

Embedding generation with table summarization

pgvector storage with HNSW indexing

This is chapter 2 of AI Finance Analyst.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 1: Data Lake

Ch. 3: Retrieval System