Field Extraction
Turning Text into Structured Data
The Core Value
Classification tells you WHAT a document is. Field extraction tells you what's IN it. This is where a PDF becomes structured data — the transformation that makes document processing valuable.
INPUT: "INVOICE: ACM-56234 | Vendor: Acme Industrial Supply | Total: $12,345.67"
OUTPUT: { invoiceNumber: "ACM-56234", vendor: "Acme Industrial Supply", total: 12345.67 }The challenge: documents don't follow a single format. One vendor's invoice says "Invoice Number:" and another says "Inv#:". One puts the total at the bottom, another puts it at the top. Your extractor needs to handle all variations.
Three Extraction Strategies
Production systems use multiple strategies in priority order, falling back from highest confidence to lowest:
1. Template Matching (Highest Confidence)
Template matching uses regex patterns specific to each document type. The invoice template has patterns like:
invoiceNumber: /(?:INVOICE|Inv[#:]?)\s*[:#]?\s*([A-Z]{2,5}-\d{4,6})/i
vendor: /(?:Vendor|From|Supplier):\s*(.+)/i
total: /(?:TOTAL|Amount Due):\s*\$([\d,]+\.\d{2})/iEach pattern captures a specific format. Template matching is fast and accurate when the document follows the expected format — typically 95%+ accuracy on known templates.
The weakness: Templates are brittle. If a vendor changes their format from "Vendor:" to "Billed From:", the template stops matching. You need to maintain regex patterns for every vendor variation you encounter.
2. Key-Value Detection (Medium Confidence)
Key-value detection is format-agnostic. It finds any line matching "Label: Value" regardless of what the label says. Three patterns:
Key: Value — The most common format (confidence 0.85)Key = Value — Seen in configuration-style documents (confidence 0.80)LABEL value — Fixed-width formatting (confidence 0.70)Key-value detection catches fields that templates miss. If a vendor uses "Billed From:" (not in the template), key-value detection still finds it — with the normalized key "billedFrom".
The weakness: No filtering. "Note: Please pay by Friday" matches the colon pattern, producing a "note" field that isn't useful. In production, you'd filter key-value results against known field names.
3. Table Parsing (High Confidence for Structured Data)
Tables detected during ingestion retain their structure — headers and rows. The table parser converts these into named fields:
Table: [Description, Quantity, Unit Price, Line Total]
Row: ["Widget A-100", "5", "$49.99", "$249.95"]
→ lineItem_1_description: "Widget A-100"
→ lineItem_1_quantity: 5
→ lineItem_1_unitPrice: "49.99"
→ lineItem_1_lineTotal: "249.95"Table parsing is highly reliable because the structure was preserved during ingestion. The parser knows which cell is "Quantity" because it reads the header row.
The Orchestration Pattern
The field extractor runs all three strategies and merges results:
This prevents duplicates: if both template matching and key-value detection find the vendor name, only the template match (higher confidence) is kept.
Confidence Per Field
Every extracted field carries a confidence score from its source strategy:
| Source | Base Confidence | Why |
|---|---|---|
| Template match | 0.90-0.95 | Matched a known pattern |
| Key-value | 0.70-0.85 | Found a structural pattern |
| Table cell | Table's confidence | Inherited from table detection |
These per-field confidences feed into the validation layer (Module 4) and ultimately determine whether the document is auto-accepted or sent to human review.
Named Entity Extraction
Beyond template fields, some documents contain entities that don't have labels:
Named entity recognition (NER) handles these. For production pipelines, LLM-based NER works well: send a paragraph to Claude with "Extract all company names, dates, and monetary amounts from this text." The results get lower confidence (0.60-0.75) because they lack structural context.
Common Pitfalls
Currency Format Ambiguity
"1,234" means $1,234.00 in the US but $1.234 (one dollar, 23.4 cents) in some European formats. Always check the document's locale before parsing currencies.
Date Format Ambiguity
"01/02/2024" is January 2 (US) or February 1 (European). Without locale context, dates are ambiguous. The safest approach: look for unambiguous dates in the document (e.g., where the day is > 12) to infer the format, then apply it consistently.
Multi-Page Fields
A line items table might span two pages. If your table detector works per-page, it produces two separate tables instead of one. The fix: merge tables from adjacent pages when the headers match.
Nested Tables
Some invoices have tables within tables (e.g., a summary table containing a detailed breakdown). Simple table detection misses the nesting. For complex layouts, use a hierarchical table detector or an LLM.
This is chapter 3 of AI Document Processing.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details