5 min

Field Extraction

Turning Text into Structured Data

The Core Value

Classification tells you WHAT a document is. Field extraction tells you what's IN it. This is where a PDF becomes structured data — the transformation that makes document processing valuable.

INPUT:  "INVOICE: ACM-56234 | Vendor: Acme Industrial Supply | Total: $12,345.67"
OUTPUT: { invoiceNumber: "ACM-56234", vendor: "Acme Industrial Supply", total: 12345.67 }

The challenge: documents don't follow a single format. One vendor's invoice says "Invoice Number:" and another says "Inv#:". One puts the total at the bottom, another puts it at the top. Your extractor needs to handle all variations.

Three Extraction Strategies

Production systems use multiple strategies in priority order, falling back from highest confidence to lowest:

1. Template Matching (Highest Confidence)

Template matching uses regex patterns specific to each document type. The invoice template has patterns like:

invoiceNumber: /(?:INVOICE|Inv[#:]?)\s*[:#]?\s*([A-Z]{2,5}-\d{4,6})/i
vendor:        /(?:Vendor|From|Supplier):\s*(.+)/i
total:         /(?:TOTAL|Amount Due):\s*\$([\d,]+\.\d{2})/i

Each pattern captures a specific format. Template matching is fast and accurate when the document follows the expected format — typically 95%+ accuracy on known templates.

The weakness: Templates are brittle. If a vendor changes their format from "Vendor:" to "Billed From:", the template stops matching. You need to maintain regex patterns for every vendor variation you encounter.

2. Key-Value Detection (Medium Confidence)

Key-value detection is format-agnostic. It finds any line matching "Label: Value" regardless of what the label says. Three patterns:

Key: Value — The most common format (confidence 0.85)

Key = Value — Seen in configuration-style documents (confidence 0.80)

LABEL value — Fixed-width formatting (confidence 0.70)

Key-value detection catches fields that templates miss. If a vendor uses "Billed From:" (not in the template), key-value detection still finds it — with the normalized key "billedFrom".

The weakness: No filtering. "Note: Please pay by Friday" matches the colon pattern, producing a "note" field that isn't useful. In production, you'd filter key-value results against known field names.

3. Table Parsing (High Confidence for Structured Data)

Tables detected during ingestion retain their structure — headers and rows. The table parser converts these into named fields:

Table: [Description, Quantity, Unit Price, Line Total]
Row:   ["Widget A-100", "5", "$49.99", "$249.95"]

→ lineItem_1_description: "Widget A-100"
→ lineItem_1_quantity: 5
→ lineItem_1_unitPrice: "49.99"
→ lineItem_1_lineTotal: "249.95"

Table parsing is highly reliable because the structure was preserved during ingestion. The parser knows which cell is "Quantity" because it reads the header row.

The Orchestration Pattern

The field extractor runs all three strategies and merges results:

Template matching runs first and "claims" fields

Key-value detection runs second, only adding fields not already claimed

Table parsing runs third, adding structured data from tables

This prevents duplicates: if both template matching and key-value detection find the vendor name, only the template match (higher confidence) is kept.

Confidence Per Field

Every extracted field carries a confidence score from its source strategy:

Source	Base Confidence	Why
Template match	0.90-0.95	Matched a known pattern
Key-value	0.70-0.85	Found a structural pattern
Table cell	Table's confidence	Inherited from table detection

These per-field confidences feed into the validation layer (Module 4) and ultimately determine whether the document is auto-accepted or sent to human review.

Named Entity Extraction

Beyond template fields, some documents contain entities that don't have labels:

Company names mentioned in contract body text

Dates embedded in paragraphs ("The parties agree that by March 15, 2024...")

Dollar amounts in prose ("not to exceed $50,000")

Named entity recognition (NER) handles these. For production pipelines, LLM-based NER works well: send a paragraph to Claude with "Extract all company names, dates, and monetary amounts from this text." The results get lower confidence (0.60-0.75) because they lack structural context.

Common Pitfalls

Currency Format Ambiguity

"1,234" means $1,234.00 in the US but $1.234 (one dollar, 23.4 cents) in some European formats. Always check the document's locale before parsing currencies.

Date Format Ambiguity

"01/02/2024" is January 2 (US) or February 1 (European). Without locale context, dates are ambiguous. The safest approach: look for unambiguous dates in the document (e.g., where the day is > 12) to infer the format, then apply it consistently.

Multi-Page Fields

A line items table might span two pages. If your table detector works per-page, it produces two separate tables instead of one. The fix: merge tables from adjacent pages when the headers match.

Nested Tables

Some invoices have tables within tables (e.g., a summary table containing a detailed breakdown). Simple table detection misses the nesting. For complex layouts, use a hierarchical table detector or an LLM.

This is chapter 3 of AI Document Processing.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Classification

Ch. 4: Validation & Enrichment