Classification
Teaching Your Pipeline to Recognize Documents
Why Classify Before Extracting
You could skip classification entirely and try to extract every possible field from every document. But this creates two problems:
Classification tells the extractor what to look for and what to ignore. It's the routing layer that directs each document to the right extraction template.
Keyword Scoring
The simplest classification approach: count indicative keywords.
| Document Type | Strong Keywords | Weak Keywords |
|---|---|---|
| Invoice | "invoice number", "bill to", "payment terms" | "total", "tax", "date" |
| Contract | "indemnification", "governing law", "termination" | "agreement", "party", "term" |
| Receipt | "receipt", "merchant", "card ending" | "total", "tax" |
| Form | "required fields", "applicant", "submission" | "form", "template" |
Strong keywords (longer, more specific) get double weight because they rarely appear outside their document type. "Indemnification" is almost exclusively a contract word. "Total" appears in invoices, receipts, and sometimes contracts.
The classifier sums keyword weights per type and picks the highest scorer. Simple, fast, and surprisingly effective for well-formatted business documents.
Structural Features
Keywords tell you what the document talks about. Structural features tell you what it looks like.
Table Presence
Invoices almost always have a table of line items. Contracts rarely have data tables (they might have a table of defined terms, but that's less common). Detecting table structure — pipe-delimited rows, consistent column alignment — is a strong invoice signal.
Currency Values
Dollar amounts like "$1,234.56" appear in invoices and receipts but less frequently in contracts (which reference amounts in words: "not to exceed fifty thousand dollars"). A regex for $[\d,]+\.\d{2} provides a useful signal.
Date Pairs
One date could be anything. Two dates suggest a time range — a contract period (start date, end date) or an invoice with a due date. Counting dates using multiple format patterns (ISO, US, written) catches most formats.
Signatory Blocks
The words "signatory," "signature," "witness," and "executed" appear almost exclusively in contracts. This is one of the strongest single features for contract detection, worth +4 points.
Confidence Scoring
A classification without confidence is dangerous. The classifier says "invoice" — but is it 95% sure or 52% sure?
The Margin Method
Confidence comes from the gap between the winner and the runner-up:
A large margin means the evidence clearly points to one type. A small margin means the document could plausibly be either type.
Feature Richness Bonus
Finding more matching keywords adds confidence — it means the document has multiple independent signals pointing to the same type, not just one lucky keyword match.
Length Factor
Short documents get penalized. A 5-word document classified as "invoice" should be less trusted than a 500-word document classified as "invoice." There's simply more evidence in longer documents.
When Classification Fails
Ambiguous Documents
A Statement of Work (SOW) has contract language AND pricing tables. It might score similarly for both "contract" and "invoice." The low confidence score routes it to human review — the correct behavior.
New Document Types
When someone uploads a purchase order (a type not in your taxonomy), the classifier force-fits it into the closest known type. The result has low confidence because the keyword matches are sparse. This is another correct behavior — low confidence triggers review, and the reviewer can flag it as a new type.
The LLM Alternative
Keyword + feature classification tops out around 95% accuracy on clean documents. For higher accuracy or more document types, you'd use an LLM classifier: send the first 500 characters to Claude with a prompt like "Classify this document as one of: invoice, contract, receipt, form, purchase order, unknown." LLM classification handles ambiguity better but costs money per document. The hybrid approach: use keyword classification first, fall back to LLM only for low-confidence documents.
This is chapter 2 of AI Document Processing.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details