Validation & Enrichment
Catching Errors Before They Become Business Problems
Why Validation Matters
Extraction without validation is like a factory without quality control. The extractor says the invoice total is $12,345.67 — but is that right? Does it match the sum of line items? Is the vendor name in your approved vendor list? Is the date plausible?
In accounts payable, a single wrong extraction can mean an incorrect payment. At scale — 10,000 invoices per month — even a 1% error rate produces 100 wrong payments. Validation is your last line of defense before data enters downstream systems.
Four Validation Layers
Layer 1: Schema Validation
Schema validation checks each field independently against its type definition:
| Check | Example | Severity |
|---|---|---|
| Required fields | Invoice must have a total | Critical |
| Type checking | Total must be a number, not "TBD" | Error |
| Range checking | Total must be positive, under $1M | Warning |
| Pattern matching | Invoice number must match AAA-99999 | Warning |
Schema validation catches obvious errors: missing required fields, wrong data types, values outside expected ranges. It's fast, stateless, and catches about 60% of extraction errors.
Layer 2: Cross-Field Validation
Cross-field checks validate relationships between fields:
subtotal + tax = total (within rounding tolerance)dueDate >= invoiceDatepartyA != partyB in contractsThese catch errors that single-field validation misses. A total of $1,180.00 passes schema validation (it's a positive number). But if subtotal is $1,000.00 and tax is $80.00, the expected total is $1,080.00 — the cross-field check catches the $100 discrepancy.
Layer 3: Data Normalization
Normalization isn't validation per se, but it prevents downstream errors:
1234.56 (number type)Without normalization, two identical invoices might produce different extracted data because one vendor formats dates as "MM/DD/YYYY" and another as "YYYY-MM-DD." Normalization ensures consistency.
Layer 4: Confidence Aggregation
The final layer combines signals from classification, extraction, and validation into a single confidence score:
overall = classification_confidence * 0.20
+ extraction_confidence * 0.50
+ validation_score * 0.30Extraction confidence gets the highest weight because it directly measures data quality. Validation score starts at 1.0 and gets penalized for errors (critical: -0.30, error: -0.15, warning: -0.05).
The Routing Decision
The aggregated confidence maps to three outcomes:
| Confidence | Critical Errors | Action |
|---|---|---|
| >= 0.85 | 0 | **Auto-accept** — data goes to output |
| >= 0.60 | <= 1 | **Review** — routed to human queue |
| < 0.60 | any | **Reject** — manual processing |
These thresholds are tunable. A conservative pipeline (financial services, healthcare) might set auto-accept at 0.95. An aggressive pipeline (marketing, internal docs) might accept at 0.75.
Validation Rule Design
Good validation rules follow three principles:
1. Severity Reflects Business Impact
A missing vendor name is "critical" because you can't process a payment without knowing who to pay. A non-standard invoice number format is a "warning" because the payment can still be processed.
2. Rules Are Documented and Testable
Every rule has an ID, a human-readable description, and a formal expression. This makes it possible to audit the validation logic — regulators can review the rules without reading code.
3. Rules Evolve With Data
New vendors, new document formats, and new business rules all require validation updates. The rule set should live in a configuration file (like validation-rules.json), not hard-coded in source.
Enrichment
Validation catches errors. Enrichment adds information. Common enrichment steps:
Enrichment turns extracted data into actionable data. The extracted fields are "what the document says." The enriched fields are "what the business needs."
The Feedback Loop
Validation errors are training signals. If the same field fails validation repeatedly (e.g., a vendor's tax calculation is always off by a penny due to rounding), that tells you to update either the extraction template or the validation rule. Module 6 builds the automated feedback loop that uses validation failures to improve the pipeline.
This is chapter 4 of AI Document Processing.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details