Encoding Pipeline
Table-Aware Chunking & Embeddings
Why Financial Chunking Is Different
Standard RAG chunking splits text at fixed token boundaries — 512 tokens, slide forward by 50, repeat. This works for blog posts and documentation, but it destroys financial data.
Consider an income statement:
| Q3 2024 | Q3 2023 | Change | |
|---|---|---|---|
| Revenue | $4,210M | $3,890M | +8.2% |
| Cost of Revenue | $2,105M | $1,984M | +6.1% |
| Gross Profit | $2,105M | $1,906M | +10.4% |
| Gross Margin | 50.0% | 49.0% | +100bps |
If your chunker splits this table at row 3, the AI sees "Revenue: $4,210M" in one chunk and "Gross Margin: 50.0%" in another. It cannot verify that the margin is correct (Gross Profit / Revenue = 50.0%). Worse, it might hallucinate a margin by dividing numbers from different chunks.
Rule: Financial tables must be kept intact as single chunks.
Three Chunking Strategies
1. Table-Aware Chunking
Detect financial tables in filings and preserve them as atomic units. A table is detected by:
$, %, or M/B/K suffixesEach table becomes one chunk with metadata:
is_table: truetable_type: "income_statement" | "balance_sheet" | "cash_flow" | "other"period: The fiscal period(s) coveredmetrics: Array of metric names found (["revenue", "gross_margin", "ebitda"])2. Narrative Chunking
For non-table content (MD&A, risk factors, executive commentary), use section-aware chunking:
3. Speaker-Turn Chunking
For earnings call Q&A sections:
This means a query like "What did the CFO say about margins?" can retrieve the exact exchange where the CFO discussed margin trends.
Metadata Enrichment
Financial metadata is the key to precise retrieval. Every chunk gets:
| Field | Example | Why It Matters |
|---|---|---|
| `ticker` | "NVDA" | Filter by company |
| `fiscal_period` | "Q3-2024" | Filter by time period |
| `filing_type` | "10-K" | Distinguish annual vs quarterly |
| `section_name` | "MD&A" | Filter by document section |
| `is_table` | true | Identify structured vs narrative data |
| `metrics` | ["revenue", "margin"] | Find chunks with specific financial metrics |
| `speaker` | "CFO John Smith" | Filter earnings call by speaker |
| `data_date` | "2024-10-28" | Data freshness tracking |
This metadata powers the structured filters in Module 3. Without it, you'd need to rely entirely on semantic search — which cannot reliably find "NVDA's Q3 2024 gross margin."
Embedding Considerations for Financial Data
Financial text poses unique challenges for embedding models:
Number Blindness
Most embedding models treat numbers as tokens, not quantities. "Revenue of $4.2 billion" and "Revenue of $420 million" produce similar embeddings because the words around them are the same. The model doesn't understand that these are 10x different.
Mitigation: Store raw numbers as structured metadata, not just in the text. Use metadata filters for numerical queries, not just vector similarity.
Jargon Density
Financial text is dense with domain-specific terms: EBITDA, basis points, diluted EPS, free cash flow yield. General-purpose embedding models may not distinguish between "operating margin" and "gross margin" as effectively as a domain-specific model would.
Mitigation: Use a model fine-tuned on financial text if available. Otherwise, ensure your hybrid search (Module 3) combines embeddings with keyword matching for financial terms.
Table Embeddings
Embedding a financial table as plain text loses the structural relationships between rows and columns. A table like "Revenue: $4.2B | Net Income: $890M" flattened to text becomes ambiguous.
Mitigation: When embedding tables, prepend a natural language summary: "Income statement for NVDA Q3 2024 showing revenue of $4.2B, net income of $890M, gross margin of 50.0%." This gives the embedding model semantic context for the numerical data.
pgvector Storage
The finance_chunks table stores everything:
CREATE TABLE finance_chunks (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
document_id TEXT NOT NULL,
chunk_index INT NOT NULL,
content TEXT NOT NULL,
embedding vector(384) NOT NULL,
metadata JSONB NOT NULL,
created_at TIMESTAMPTZ DEFAULT now()
);
CREATE INDEX ON finance_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64);Why HNSW over IVFFlat? HNSW delivers consistent sub-millisecond queries without needing periodic rebuilds. IVFFlat is faster to build but degrades as data is inserted. For financial data that's continuously updated (new filings, market data), HNSW's insert-without-rebuild property is essential.
What You'll Build
This is chapter 2 of AI Finance Analyst.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details