Back to guides
6
5 min

Production Pipeline

Monitoring, Retraining, and Scale

The Production Gap

A demo processes 5 documents and works perfectly. Production processes 50,000 per month and breaks in ways you never anticipated. New vendors, changed formats, scanned documents with coffee stains, multi-language invoices, and documents that are technically two stapled-together documents.

Production engineering bridges this gap with four systems: accuracy monitoring, human review, retraining triggers, and throughput optimization.

Accuracy Monitoring

The Metrics That Matter

Three metrics define extraction quality:

  • Precision — Of all fields the pipeline extracted, how many were correct? Low precision means the pipeline is hallucinating fields.
  • Recall — Of all fields that should have been extracted, how many were found? Low recall means the pipeline is missing fields.
  • F1 Score — The harmonic mean of precision and recall. F1 of 0.95 is good. Below 0.90, investigate.
  • Per-Field Tracking

    Overall F1 can hide problems. If vendor name extraction is at 0.99 but tax extraction is at 0.70, the average might look acceptable (0.85) while tax processing is severely broken. Track F1 per field to catch localized degradation.

    Drift Detection

    Accuracy doesn't degrade overnight — it drifts. A vendor changes their template. A new document type starts arriving. The scanner gets misaligned. Drift detection compares current F1 to historical baselines and flags fields that have degraded:

    SeverityTriggerAction
    Low1 field below 0.80 F1Log and monitor
    Medium2 fields degradedSchedule retraining
    High3+ fields degradedImmediate intervention

    Human Review Queue

    Routing Logic

    Every processed document gets one of three destinations:

  • Auto-accept (confidence >= 0.95, no critical errors) — Straight to output. This should be 85-95% of documents.
  • Review queue (confidence 0.70-0.95, or validation errors) — A human verifies and corrects.
  • Manual processing (confidence < 0.70) — The pipeline's output is unreliable; process from scratch.
  • Priority Tiers

    Not all review items are equal. The queue assigns priority:

  • High — Critical validation errors or very low confidence. Process immediately.
  • Medium — Non-critical errors. Process within the day.
  • Low — Moderate confidence, no errors. Spot-check when time allows.
  • The Dual Purpose

    The review queue serves two functions:

  • Quality gate — Prevents bad data from reaching downstream systems
  • Training data factory — Every correction is a labeled example that improves the pipeline
  • This dual purpose is what makes the human-in-the-loop pattern economically viable. You're not just paying reviewers to fix errors — you're paying them to generate training data that reduces future errors.

    Retraining Triggers

    When to Retrain

    The retrain trigger evaluates four signals:

  • F1 below threshold — Overall F1 drops below 0.90
  • Drift detected — Specific fields degrading
  • F1 regression — Current F1 is 5%+ lower than previous period
  • High review rate — More than 20% of documents need human review
  • If any signal fires, the system recommends retraining. Urgency depends on severity:

  • Immediate — F1 dropped 10%+ or high-severity drift. Usually means a major format change.
  • Scheduled — Quality is degrading but not critically. Schedule for the next training cycle.
  • What Retraining Means

    For template-based extraction: update regex patterns to handle new formats. Review the corrections from the human review queue, identify pattern failures, and add new regex variants.

    For LLM-based extraction: fine-tune on the new ground truth data. The corrections from reviewers become the training examples.

    The Safety Guard

    Never retrain on too few samples. If you processed 3 documents over a holiday weekend and 1 had an error, that's a 33% error rate — but it's meaningless. The system requires a minimum sample size (50+ documents) before triggering retraining decisions.

    Throughput Optimization

    The Scale Problem

    Each document takes ~200ms to process through the full pipeline. Sequentially processing 50,000 documents takes 10,000 seconds (2.8 hours). With 10x concurrency, it's 17 minutes. Throughput optimization makes the difference between a pipeline that blocks your workflow and one that finishes before your coffee.

    Three Optimization Levers

  • Batching — Split documents into groups (e.g., 25 per batch). This controls memory: processing 50,000 simultaneously would exhaust RAM.
  • Concurrency — Within each batch, process multiple documents simultaneously. The concurrency limit prevents overwhelming databases, APIs, and OCR engines.
  • Retry with backoff — Transient failures (network timeouts, service blips) get retried automatically. Persistent failures (corrupt files) fail after 3 attempts. This turns flaky infrastructure into reliable throughput.
  • Metrics to Watch

  • Documents per second — Your headline number. Target: 50+ docs/sec for simple documents, 5-10 docs/sec for complex documents requiring OCR.
  • P95 latency — 95th percentile processing time. If average is 200ms but P95 is 5,000ms, some documents are much slower (likely OCR or complex tables).
  • Error rate — Percentage of documents that failed all retries. Should be under 0.1% for a healthy pipeline.
  • The Data Flywheel

    The complete production loop:

    Documents arrive
      → Throughput optimizer batches and processes
      → Pipeline: ingest → classify → extract → validate
      → High confidence: auto-accept → output
      → Low confidence: review queue → human correction → output
      → Accuracy monitor tracks per-field F1
      → Drift detected → retrain trigger fires
      → Template update or model fine-tune
      → Improved extraction → fewer documents need review
      → Loop continues

    This is the data flywheel. More documents → more corrections → better templates → fewer errors → less review needed → lower cost per document. The pipeline gets better the more you use it.

    The Economic Argument

    Document processing pricing: manual processing costs $2-5 per document. Automated processing with human review costs $0.10-0.50 per document (amortized over infrastructure and reviewer time). At 50,000 documents per month, that's the difference between $100K-250K/year (manual) and $5K-25K/year (automated with review). The ROI of the pipeline — and of this course — is measured in hundreds of thousands of dollars annually for a mid-size organization.

    This is chapter 6 of AI Document Processing.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details