Back to guides
1
5 min

Data Lake

Ingest & Normalize Financial Data

Why a Financial Data Lake?

Financial AI systems operate on data that is fundamentally different from typical enterprise data. SEC filings follow strict regulatory formats. Earnings transcripts mix structured Q&A with free-form executive commentary. Market data changes by the second. Analyst notes blend quantitative ratings with qualitative thesis arguments.

Before an AI can analyze any of this, you need a unified ingestion pipeline that normalizes these disparate sources into a common format — while preserving the financial metadata that makes each document useful.

The Five Source Types

SEC Filings (10-K, 10-Q)

The gold standard of financial data. These documents are filed with the Securities and Exchange Commission and contain audited financial statements, management discussion and analysis (MD&A), risk factors, and forward guidance.

Key challenges:

  • Tables everywhere — Income statements, balance sheets, and cash flow statements are dense numerical grids
  • XBRL tagging — SEC requires structured data markup, but quality varies wildly between filers
  • Fiscal year misalignment — Some companies end their fiscal year in June, others in December. Your pipeline must normalize to comparable periods
  • Earnings Call Transcripts

    Quarterly calls where executives present results and analysts ask questions. These are uniquely valuable because they capture:

  • Management sentiment — Tone, emphasis, and hedging language that doesn't appear in filings
  • Analyst concerns — The questions analysts ask reveal what the market is focused on
  • Forward guidance — CEOs make verbal commitments that often aren't in the 10-Q
  • The Q&A section is particularly important. Each question-answer pair is a natural unit of information that should be preserved during chunking.

    Market Data

    Real-time and historical price data, market capitalization, P/E ratios, beta, and trading volumes. This data is:

  • Highly structured — Clean numerical fields with standard schemas
  • Time-sensitive — A stock price from yesterday is stale for today's analysis
  • Computationally useful — Ratios and derived metrics can be calculated directly
  • Internal Reports

    Quarterly business reviews, competitive analyses, risk assessments, and strategic memos. These documents provide context that public filings cannot — internal margins by product line, pipeline forecasts, customer concentration analysis.

    Analyst Notes

    Research notes from sell-side and buy-side analysts. Each note typically contains a rating (Buy/Hold/Sell), price target, investment thesis, key risks, and catalysts. Normalizing ratings across different analyst firms is a non-trivial data quality problem — one firm's "Overweight" is another's "Buy."

    Key Concepts

    The Document Interface

    Every piece of financial data — whether it's a 10-K filing or an analyst note — becomes a Document with:

  • id — unique identifier (e.g., "NVDA-10K-2024-Q3")
  • content — the actual text
  • metadata — ticker, fiscal_period, filing_type, date, section_name, currency
  • source_type — which loader produced it
  • Financial metadata is richer than general enterprise data. A CRM record needs an account name; a filing needs a ticker, fiscal period, filing type, section name, and filing date. Getting this metadata right is critical — it powers the structured filters in the retrieval system.

    Loaders

    One loader per data source. Each financial loader knows how to:

  • Parse its specific format (JSON filings, transcript text, market data snapshots)
  • Extract financial metadata (ticker symbols, fiscal periods, filing dates)
  • Normalize currencies and date formats
  • Validate that financial figures are reasonable (no negative revenue, margins between -100% and 100%)
  • Financial Validation

    Financial data validation goes beyond schema checks:

  • Accounting identity checks — Does Revenue - Expenses = Net Income (approximately)?
  • Temporal consistency — Is Q3 data actually from Q3 dates?
  • Cross-source consistency — Does the revenue in the 10-K match what was reported in the earnings call?
  • Freshness tracking — When was this data last updated? Is it stale?
  • Architecture Pattern

    10-K/10-Q ──→ Filing Loader ────────┐
    Transcripts ─→ Transcript Loader ───┤
    Market Data ─→ Market Loader ───────┤──→ Validate ──→ Document[]
    Reports ─────→ Report Loader ───────┤
    Analyst Notes → Analyst Loader ─────┘

    Each loader is independent. Adding a new source (Bloomberg feeds, Reuters data, XBRL structured data) means writing one new loader — nothing else changes.

    What You'll Build

  • Run the pre-seeded ingestion pipeline and see 28 documents flow through
  • Explore each financial data format and understand the parsing patterns
  • Walk through the loader code and the Document interface with financial metadata
  • Extend the pipeline with improved validation or a new financial data source
  • Glossary

    TermMeaning
    10-KAnnual SEC filing with audited financial statements
    10-QQuarterly SEC filing (unaudited)
    MD&AManagement Discussion & Analysis section of a filing
    Fiscal PeriodThe reporting period (Q1-Q4, FY) — may not align with calendar
    XBRLeXtensible Business Reporting Language — structured financial data
    TickerStock symbol identifying a publicly traded company

    This is chapter 1 of AI Finance Analyst.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details