5 min

Validation & Enrichment

Catching Errors Before They Become Business Problems

Why Validation Matters

Extraction without validation is like a factory without quality control. The extractor says the invoice total is $12,345.67 — but is that right? Does it match the sum of line items? Is the vendor name in your approved vendor list? Is the date plausible?

In accounts payable, a single wrong extraction can mean an incorrect payment. At scale — 10,000 invoices per month — even a 1% error rate produces 100 wrong payments. Validation is your last line of defense before data enters downstream systems.

Four Validation Layers

Layer 1: Schema Validation

Schema validation checks each field independently against its type definition:

Check	Example	Severity
Required fields	Invoice must have a total	Critical
Type checking	Total must be a number, not "TBD"	Error
Range checking	Total must be positive, under $1M	Warning
Pattern matching	Invoice number must match AAA-99999	Warning

Schema validation catches obvious errors: missing required fields, wrong data types, values outside expected ranges. It's fast, stateless, and catches about 60% of extraction errors.

Layer 2: Cross-Field Validation

Cross-field checks validate relationships between fields:

Arithmetic: subtotal + tax = total (within rounding tolerance)

Date ordering: dueDate >= invoiceDate

Entity consistency: partyA != partyB in contracts

Reasonableness: Contract term under 10 years, receipt under $10,000

These catch errors that single-field validation misses. A total of $1,180.00 passes schema validation (it's a positive number). But if subtotal is $1,000.00 and tax is $80.00, the expected total is $1,080.00 — the cross-field check catches the $100 discrepancy.

Layer 3: Data Normalization

Normalization isn't validation per se, but it prevents downstream errors:

Dates: "March 15, 2024" and "03/15/2024" and "2024-03-15" all become "2024-03-15"

Currency: "$1,234.56" becomes 1234.56 (number type)

Booleans: "Yes", "TRUE", "true" all become "true"

Whitespace: Leading/trailing spaces stripped

Without normalization, two identical invoices might produce different extracted data because one vendor formats dates as "MM/DD/YYYY" and another as "YYYY-MM-DD." Normalization ensures consistency.

Layer 4: Confidence Aggregation

The final layer combines signals from classification, extraction, and validation into a single confidence score:

overall = classification_confidence * 0.20
        + extraction_confidence   * 0.50
        + validation_score        * 0.30

Extraction confidence gets the highest weight because it directly measures data quality. Validation score starts at 1.0 and gets penalized for errors (critical: -0.30, error: -0.15, warning: -0.05).

The Routing Decision

The aggregated confidence maps to three outcomes:

Confidence	Critical Errors	Action
>= 0.85	0	Auto-accept — data goes to output
>= 0.60	<= 1	Review — routed to human queue
< 0.60	any	Reject — manual processing

These thresholds are tunable. A conservative pipeline (financial services, healthcare) might set auto-accept at 0.95. An aggressive pipeline (marketing, internal docs) might accept at 0.75.

Validation Rule Design

Good validation rules follow three principles:

1. Severity Reflects Business Impact

A missing vendor name is "critical" because you can't process a payment without knowing who to pay. A non-standard invoice number format is a "warning" because the payment can still be processed.

2. Rules Are Documented and Testable

Every rule has an ID, a human-readable description, and a formal expression. This makes it possible to audit the validation logic — regulators can review the rules without reading code.

3. Rules Evolve With Data

New vendors, new document formats, and new business rules all require validation updates. The rule set should live in a configuration file (like validation-rules.json), not hard-coded in source.

Enrichment

Validation catches errors. Enrichment adds information. Common enrichment steps:

Vendor lookup: Match the extracted vendor name against your vendor master list. Add vendor ID, payment terms, and bank details.

Currency conversion: If the invoice is in EUR but your system uses USD, apply the exchange rate for the invoice date.

Duplicate detection: Check if this invoice number from this vendor was already processed. Duplicate invoices are a common fraud vector.

GL coding: Based on the line item descriptions and vendor category, suggest general ledger account codes.

Enrichment turns extracted data into actionable data. The extracted fields are "what the document says." The enriched fields are "what the business needs."

The Feedback Loop

Validation errors are training signals. If the same field fails validation repeatedly (e.g., a vendor's tax calculation is always off by a penny due to rounding), that tells you to update either the extraction template or the validation rule. Module 6 builds the automated feedback loop that uses validation failures to improve the pipeline.

This is chapter 4 of AI Document Processing.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 3: Field Extraction

Ch. 5: Processing App