Back to guides
2
5 min

Encoding Pipeline

Table-Aware Chunking & Embeddings

Why Financial Chunking Is Different

Standard RAG chunking splits text at fixed token boundaries — 512 tokens, slide forward by 50, repeat. This works for blog posts and documentation, but it destroys financial data.

Consider an income statement:

Q3 2024Q3 2023Change
Revenue$4,210M$3,890M+8.2%
Cost of Revenue$2,105M$1,984M+6.1%
Gross Profit$2,105M$1,906M+10.4%
Gross Margin50.0%49.0%+100bps

If your chunker splits this table at row 3, the AI sees "Revenue: $4,210M" in one chunk and "Gross Margin: 50.0%" in another. It cannot verify that the margin is correct (Gross Profit / Revenue = 50.0%). Worse, it might hallucinate a margin by dividing numbers from different chunks.

Rule: Financial tables must be kept intact as single chunks.

Three Chunking Strategies

1. Table-Aware Chunking

Detect financial tables in filings and preserve them as atomic units. A table is detected by:

  • Multiple columns of numbers with $, %, or M/B/K suffixes
  • Row headers matching known financial line items (Revenue, Net Income, EPS, etc.)
  • Consistent column alignment or delimiter patterns
  • Each table becomes one chunk with metadata:

  • is_table: true
  • table_type: "income_statement" | "balance_sheet" | "cash_flow" | "other"
  • period: The fiscal period(s) covered
  • metrics: Array of metric names found (["revenue", "gross_margin", "ebitda"])
  • 2. Narrative Chunking

    For non-table content (MD&A, risk factors, executive commentary), use section-aware chunking:

  • Split at section headers (## Risk Factors, ## Liquidity, etc.)
  • If a section exceeds 512 tokens, split at paragraph boundaries within the section
  • 50-token overlap between chunks to maintain context continuity
  • Preserve section name in metadata for filtering
  • 3. Speaker-Turn Chunking

    For earnings call Q&A sections:

  • Each analyst question + executive answer = one chunk
  • Preserve speaker names and titles in metadata
  • Tag with topic if detectable (e.g., "margins", "guidance", "competition")
  • This means a query like "What did the CFO say about margins?" can retrieve the exact exchange where the CFO discussed margin trends.

    Metadata Enrichment

    Financial metadata is the key to precise retrieval. Every chunk gets:

    FieldExampleWhy It Matters
    `ticker`"NVDA"Filter by company
    `fiscal_period`"Q3-2024"Filter by time period
    `filing_type`"10-K"Distinguish annual vs quarterly
    `section_name`"MD&A"Filter by document section
    `is_table`trueIdentify structured vs narrative data
    `metrics`["revenue", "margin"]Find chunks with specific financial metrics
    `speaker`"CFO John Smith"Filter earnings call by speaker
    `data_date`"2024-10-28"Data freshness tracking

    This metadata powers the structured filters in Module 3. Without it, you'd need to rely entirely on semantic search — which cannot reliably find "NVDA's Q3 2024 gross margin."

    Embedding Considerations for Financial Data

    Financial text poses unique challenges for embedding models:

    Number Blindness

    Most embedding models treat numbers as tokens, not quantities. "Revenue of $4.2 billion" and "Revenue of $420 million" produce similar embeddings because the words around them are the same. The model doesn't understand that these are 10x different.

    Mitigation: Store raw numbers as structured metadata, not just in the text. Use metadata filters for numerical queries, not just vector similarity.

    Jargon Density

    Financial text is dense with domain-specific terms: EBITDA, basis points, diluted EPS, free cash flow yield. General-purpose embedding models may not distinguish between "operating margin" and "gross margin" as effectively as a domain-specific model would.

    Mitigation: Use a model fine-tuned on financial text if available. Otherwise, ensure your hybrid search (Module 3) combines embeddings with keyword matching for financial terms.

    Table Embeddings

    Embedding a financial table as plain text loses the structural relationships between rows and columns. A table like "Revenue: $4.2B | Net Income: $890M" flattened to text becomes ambiguous.

    Mitigation: When embedding tables, prepend a natural language summary: "Income statement for NVDA Q3 2024 showing revenue of $4.2B, net income of $890M, gross margin of 50.0%." This gives the embedding model semantic context for the numerical data.

    pgvector Storage

    The finance_chunks table stores everything:

    CREATE TABLE finance_chunks (
      id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
      document_id TEXT NOT NULL,
      chunk_index INT NOT NULL,
      content TEXT NOT NULL,
      embedding vector(384) NOT NULL,
      metadata JSONB NOT NULL,
      created_at TIMESTAMPTZ DEFAULT now()
    );
    
    CREATE INDEX ON finance_chunks
      USING hnsw (embedding vector_cosine_ops)
      WITH (m = 16, ef_construction = 64);

    Why HNSW over IVFFlat? HNSW delivers consistent sub-millisecond queries without needing periodic rebuilds. IVFFlat is faster to build but degrades as data is inserted. For financial data that's continuously updated (new filings, market data), HNSW's insert-without-rebuild property is essential.

    What You'll Build

  • Table detection heuristic that identifies income statements, balance sheets, and cash flow tables
  • Table-aware chunker that keeps financial tables as atomic chunks
  • Speaker-turn chunker for earnings call Q&A sections
  • Metadata enrichment pipeline with financial fields
  • Embedding generation with table summarization
  • pgvector storage with HNSW indexing
  • This is chapter 2 of AI Finance Analyst.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details