4 min

Classification

Teaching Your Pipeline to Recognize Documents

Why Classify Before Extracting

You could skip classification entirely and try to extract every possible field from every document. But this creates two problems:

Noise. A contract has no "invoice number" field, but a generic extractor might find something that looks like one (a reference number in a clause). Without classification, that false positive enters your data.

Missing context. The word "term" means contract duration in a contract and payment deadline in an invoice. Without knowing the document type, the extractor can't resolve this ambiguity.

Classification tells the extractor what to look for and what to ignore. It's the routing layer that directs each document to the right extraction template.

Keyword Scoring

The simplest classification approach: count indicative keywords.

Document Type	Strong Keywords	Weak Keywords
Invoice	"invoice number", "bill to", "payment terms"	"total", "tax", "date"
Contract	"indemnification", "governing law", "termination"	"agreement", "party", "term"
Receipt	"receipt", "merchant", "card ending"	"total", "tax"
Form	"required fields", "applicant", "submission"	"form", "template"

Strong keywords (longer, more specific) get double weight because they rarely appear outside their document type. "Indemnification" is almost exclusively a contract word. "Total" appears in invoices, receipts, and sometimes contracts.

The classifier sums keyword weights per type and picks the highest scorer. Simple, fast, and surprisingly effective for well-formatted business documents.

Structural Features

Keywords tell you what the document talks about. Structural features tell you what it looks like.

Table Presence

Invoices almost always have a table of line items. Contracts rarely have data tables (they might have a table of defined terms, but that's less common). Detecting table structure — pipe-delimited rows, consistent column alignment — is a strong invoice signal.

Currency Values

Dollar amounts like "$1,234.56" appear in invoices and receipts but less frequently in contracts (which reference amounts in words: "not to exceed fifty thousand dollars"). A regex for $[\d,]+\.\d{2} provides a useful signal.

Date Pairs

One date could be anything. Two dates suggest a time range — a contract period (start date, end date) or an invoice with a due date. Counting dates using multiple format patterns (ISO, US, written) catches most formats.

Signatory Blocks

The words "signatory," "signature," "witness," and "executed" appear almost exclusively in contracts. This is one of the strongest single features for contract detection, worth +4 points.

Confidence Scoring

A classification without confidence is dangerous. The classifier says "invoice" — but is it 95% sure or 52% sure?

The Margin Method

Confidence comes from the gap between the winner and the runner-up:

Invoice scored 12, Receipt scored 3 → margin = (12-3)/12 = 0.75 → high confidence

Invoice scored 12, Receipt scored 10 → margin = (12-10)/12 = 0.17 → low confidence

A large margin means the evidence clearly points to one type. A small margin means the document could plausibly be either type.

Feature Richness Bonus

Finding more matching keywords adds confidence — it means the document has multiple independent signals pointing to the same type, not just one lucky keyword match.

Length Factor

Short documents get penalized. A 5-word document classified as "invoice" should be less trusted than a 500-word document classified as "invoice." There's simply more evidence in longer documents.

When Classification Fails

Ambiguous Documents

A Statement of Work (SOW) has contract language AND pricing tables. It might score similarly for both "contract" and "invoice." The low confidence score routes it to human review — the correct behavior.

New Document Types

When someone uploads a purchase order (a type not in your taxonomy), the classifier force-fits it into the closest known type. The result has low confidence because the keyword matches are sparse. This is another correct behavior — low confidence triggers review, and the reviewer can flag it as a new type.

The LLM Alternative

Keyword + feature classification tops out around 95% accuracy on clean documents. For higher accuracy or more document types, you'd use an LLM classifier: send the first 500 characters to Claude with a prompt like "Classify this document as one of: invoice, contract, receipt, form, purchase order, unknown." LLM classification handles ambiguity better but costs money per document. The hybrid approach: use keyword classification first, fall back to LLM only for low-confidence documents.

This is chapter 2 of AI Document Processing.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 1: Document Ingestion

Ch. 3: Field Extraction