5 min

Document Ingestion

From Raw Files to Structured Text

Why Ingestion Is the Foundation

Every document processing pipeline starts with the same problem: you have files (PDFs, images, scans) and you need text. This sounds trivial — just read the file. But in practice, document ingestion is where most pipelines fail.

A PDF is not a text file. It's a collection of positioned glyphs, vector paths, and embedded images. The "text" you see in a PDF viewer is reconstructed from glyph positions — and that reconstruction can go wrong. Columns merge. Table cells lose their alignment. Headers get mixed with body text. OCR introduces character substitutions ("0" vs "O", "1" vs "l").

The rule of thumb: If your parser produces garbage, every downstream step inherits that garbage. A classifier that receives "Invo1ce" instead of "Invoice" might misclassify the document. An extractor that receives merged table columns will extract wrong amounts. Ingestion quality sets the ceiling for pipeline accuracy.

PDF Parsing Strategies

There are three levels of PDF parsing, each with different trade-offs:

Strategy	When to Use	Accuracy	Speed
Text extraction	Native (digital) PDFs with selectable text	High	Fast
OCR	Scanned documents, images, photos of documents	Medium	Slow
Layout analysis	Complex layouts with tables, columns, headers	High	Medium

For the Document Processing Pipeline, we use text extraction as the primary strategy and OCR as a fallback. Layout analysis (detecting columns, reading order, table boundaries) is a separate concern handled by the table detector.

Table Detection

Tables are the most valuable and most fragile part of business documents. An invoice's line items, a contract's clause list, a receipt's item breakdown — these are all tables. Losing table structure during parsing means your extractor has to re-discover it from flat text, which is error-prone.

The table detector uses three heuristics:

Pipe-delimited rows — Lines with | separators (common in formatted text output)

Tab-delimited rows — Lines with consistent tab stops

Key-value pairs — Repeated "Label: Value" patterns that form a logical table

Each heuristic produces an ExtractedTable with headers, rows, and a confidence score. Tables from different heuristics can be compared — a pipe-detected table at 0.90 confidence beats a key-value table at 0.75.

The OCR Fallback

Not every document has extractable text. Scanned documents, photos of receipts, and faxed contracts are all images. The OCR engine converts these to text, but with uncertainty.

The critical metric is word-level confidence. An OCR engine doesn't just produce text — it produces text with confidence scores per word. "Invoice" at 0.98 confidence is reliable. "Inv0ice" at 0.72 confidence tells you the OCR struggled with that word. These per-word confidences propagate to field extraction: a vendor name assembled from low-confidence words should have lower overall confidence.

Format Normalization

The ingestion layer's output must be consistent regardless of input format. Whether the document came from a native PDF, a scanned image, or a JSON data feed, the output is always a RawDocument with:

A unique ID

The full extracted text

An array of detected tables

A metadata map with source information

This normalization is what makes the rest of the pipeline format-agnostic. The classifier doesn't know or care whether it's looking at text from a PDF parser or an OCR engine. It just sees text.

Common Pitfalls

Character Encoding

PDFs can use dozens of character encodings. A parser that assumes UTF-8 will produce garbled text from older PDFs using Windows-1252 or ISO-8859-1. Always detect encoding before parsing.

Multi-Column Layouts

A two-column PDF looks like a single column when naively extracting text line by line. The parser reads "left column line 1 right column line 1" as a single line. Layout-aware parsing detects columns and reads them separately.

Header/Footer Contamination

Page numbers, company logos, and legal disclaimers appear on every page. If your parser includes these in the extracted text, your classifier and extractor have to filter them out. Better to strip headers and footers during parsing.

Embedded Images

Some PDFs embed text as images (intentionally or due to poor PDF generation). Text extraction returns nothing; OCR is required. The needsOcr() function detects this by checking the ratio of printable characters in the extracted content.

This is chapter 1 of AI Document Processing.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Classification