Back to guides
1
5 min

Document Ingestion

From Raw Files to Structured Text

Why Ingestion Is the Foundation

Every document processing pipeline starts with the same problem: you have files (PDFs, images, scans) and you need text. This sounds trivial — just read the file. But in practice, document ingestion is where most pipelines fail.

A PDF is not a text file. It's a collection of positioned glyphs, vector paths, and embedded images. The "text" you see in a PDF viewer is reconstructed from glyph positions — and that reconstruction can go wrong. Columns merge. Table cells lose their alignment. Headers get mixed with body text. OCR introduces character substitutions ("0" vs "O", "1" vs "l").

The rule of thumb: If your parser produces garbage, every downstream step inherits that garbage. A classifier that receives "Invo1ce" instead of "Invoice" might misclassify the document. An extractor that receives merged table columns will extract wrong amounts. Ingestion quality sets the ceiling for pipeline accuracy.

PDF Parsing Strategies

There are three levels of PDF parsing, each with different trade-offs:

StrategyWhen to UseAccuracySpeed
Text extractionNative (digital) PDFs with selectable textHighFast
OCRScanned documents, images, photos of documentsMediumSlow
Layout analysisComplex layouts with tables, columns, headersHighMedium

For the Document Processing Pipeline, we use text extraction as the primary strategy and OCR as a fallback. Layout analysis (detecting columns, reading order, table boundaries) is a separate concern handled by the table detector.

Table Detection

Tables are the most valuable and most fragile part of business documents. An invoice's line items, a contract's clause list, a receipt's item breakdown — these are all tables. Losing table structure during parsing means your extractor has to re-discover it from flat text, which is error-prone.

The table detector uses three heuristics:

  • Pipe-delimited rows — Lines with | separators (common in formatted text output)
  • Tab-delimited rows — Lines with consistent tab stops
  • Key-value pairs — Repeated "Label: Value" patterns that form a logical table
  • Each heuristic produces an ExtractedTable with headers, rows, and a confidence score. Tables from different heuristics can be compared — a pipe-detected table at 0.90 confidence beats a key-value table at 0.75.

    The OCR Fallback

    Not every document has extractable text. Scanned documents, photos of receipts, and faxed contracts are all images. The OCR engine converts these to text, but with uncertainty.

    The critical metric is word-level confidence. An OCR engine doesn't just produce text — it produces text with confidence scores per word. "Invoice" at 0.98 confidence is reliable. "Inv0ice" at 0.72 confidence tells you the OCR struggled with that word. These per-word confidences propagate to field extraction: a vendor name assembled from low-confidence words should have lower overall confidence.

    Format Normalization

    The ingestion layer's output must be consistent regardless of input format. Whether the document came from a native PDF, a scanned image, or a JSON data feed, the output is always a RawDocument with:

  • A unique ID
  • The full extracted text
  • An array of detected tables
  • A metadata map with source information
  • This normalization is what makes the rest of the pipeline format-agnostic. The classifier doesn't know or care whether it's looking at text from a PDF parser or an OCR engine. It just sees text.

    Common Pitfalls

    Character Encoding

    PDFs can use dozens of character encodings. A parser that assumes UTF-8 will produce garbled text from older PDFs using Windows-1252 or ISO-8859-1. Always detect encoding before parsing.

    Multi-Column Layouts

    A two-column PDF looks like a single column when naively extracting text line by line. The parser reads "left column line 1 right column line 1" as a single line. Layout-aware parsing detects columns and reads them separately.

    Header/Footer Contamination

    Page numbers, company logos, and legal disclaimers appear on every page. If your parser includes these in the extracted text, your classifier and extractor have to filter them out. Better to strip headers and footers during parsing.

    Embedded Images

    Some PDFs embed text as images (intentionally or due to poor PDF generation). Text extraction returns nothing; OCR is required. The needsOcr() function detects this by checking the ratio of printable characters in the extracted content.

    This is chapter 1 of AI Document Processing.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details