Back to guides
4
5 min

Validation & Enrichment

Catching Errors Before They Become Business Problems

Why Validation Matters

Extraction without validation is like a factory without quality control. The extractor says the invoice total is $12,345.67 — but is that right? Does it match the sum of line items? Is the vendor name in your approved vendor list? Is the date plausible?

In accounts payable, a single wrong extraction can mean an incorrect payment. At scale — 10,000 invoices per month — even a 1% error rate produces 100 wrong payments. Validation is your last line of defense before data enters downstream systems.

Four Validation Layers

Layer 1: Schema Validation

Schema validation checks each field independently against its type definition:

CheckExampleSeverity
Required fieldsInvoice must have a totalCritical
Type checkingTotal must be a number, not "TBD"Error
Range checkingTotal must be positive, under $1MWarning
Pattern matchingInvoice number must match AAA-99999Warning

Schema validation catches obvious errors: missing required fields, wrong data types, values outside expected ranges. It's fast, stateless, and catches about 60% of extraction errors.

Layer 2: Cross-Field Validation

Cross-field checks validate relationships between fields:

  • Arithmetic: subtotal + tax = total (within rounding tolerance)
  • Date ordering: dueDate >= invoiceDate
  • Entity consistency: partyA != partyB in contracts
  • Reasonableness: Contract term under 10 years, receipt under $10,000
  • These catch errors that single-field validation misses. A total of $1,180.00 passes schema validation (it's a positive number). But if subtotal is $1,000.00 and tax is $80.00, the expected total is $1,080.00 — the cross-field check catches the $100 discrepancy.

    Layer 3: Data Normalization

    Normalization isn't validation per se, but it prevents downstream errors:

  • Dates: "March 15, 2024" and "03/15/2024" and "2024-03-15" all become "2024-03-15"
  • Currency: "$1,234.56" becomes 1234.56 (number type)
  • Booleans: "Yes", "TRUE", "true" all become "true"
  • Whitespace: Leading/trailing spaces stripped
  • Without normalization, two identical invoices might produce different extracted data because one vendor formats dates as "MM/DD/YYYY" and another as "YYYY-MM-DD." Normalization ensures consistency.

    Layer 4: Confidence Aggregation

    The final layer combines signals from classification, extraction, and validation into a single confidence score:

    overall = classification_confidence * 0.20
            + extraction_confidence   * 0.50
            + validation_score        * 0.30

    Extraction confidence gets the highest weight because it directly measures data quality. Validation score starts at 1.0 and gets penalized for errors (critical: -0.30, error: -0.15, warning: -0.05).

    The Routing Decision

    The aggregated confidence maps to three outcomes:

    ConfidenceCritical ErrorsAction
    >= 0.850**Auto-accept** — data goes to output
    >= 0.60<= 1**Review** — routed to human queue
    < 0.60any**Reject** — manual processing

    These thresholds are tunable. A conservative pipeline (financial services, healthcare) might set auto-accept at 0.95. An aggressive pipeline (marketing, internal docs) might accept at 0.75.

    Validation Rule Design

    Good validation rules follow three principles:

    1. Severity Reflects Business Impact

    A missing vendor name is "critical" because you can't process a payment without knowing who to pay. A non-standard invoice number format is a "warning" because the payment can still be processed.

    2. Rules Are Documented and Testable

    Every rule has an ID, a human-readable description, and a formal expression. This makes it possible to audit the validation logic — regulators can review the rules without reading code.

    3. Rules Evolve With Data

    New vendors, new document formats, and new business rules all require validation updates. The rule set should live in a configuration file (like validation-rules.json), not hard-coded in source.

    Enrichment

    Validation catches errors. Enrichment adds information. Common enrichment steps:

  • Vendor lookup: Match the extracted vendor name against your vendor master list. Add vendor ID, payment terms, and bank details.
  • Currency conversion: If the invoice is in EUR but your system uses USD, apply the exchange rate for the invoice date.
  • Duplicate detection: Check if this invoice number from this vendor was already processed. Duplicate invoices are a common fraud vector.
  • GL coding: Based on the line item descriptions and vendor category, suggest general ledger account codes.
  • Enrichment turns extracted data into actionable data. The extracted fields are "what the document says." The enriched fields are "what the business needs."

    The Feedback Loop

    Validation errors are training signals. If the same field fails validation repeatedly (e.g., a vendor's tax calculation is always off by a penny due to rounding), that tells you to update either the extraction template or the validation rule. Module 6 builds the automated feedback loop that uses validation failures to improve the pipeline.

    This is chapter 4 of AI Document Processing.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details