Back to guides
6
4 min

Production Pipeline

The Data Flywheel

From One-shot to Continuous

Fine-tuning your Sales Companion once is a project. Keeping it sharp over time is a system. This chapter covers how to build that system — the infrastructure that turns user feedback into model improvements automatically.

The Data Flywheel

The most powerful concept in applied ML is the data flywheel: a self-reinforcing loop where your product generates the data that makes the product better.

  More users ──> More queries + feedback
       ^                    |
       |                    v
  Better model <── Curated training data

For the Sales Companion, this flywheel looks like:

  • Reps use the tool — generating queries, receiving responses
  • Reps give feedback — thumbs up/down, edits to responses, choosing one suggestion over another
  • Feedback becomes training data — thumbs-up responses are positive examples, edited responses become improved completions
  • Retrain the model — periodically fine-tune on the accumulated feedback
  • Better model attracts more usage — reps trust it more, use it for harder tasks, generating richer feedback
  • Logging for the Flywheel

    Every interaction should log:

    interface InteractionLog {
      timestamp: string;
      userId: string;
      query: string;
      retrievedDocs: string[];      // What RAG surfaced
      modelResponse: string;        // What the model said
      modelId: string;              // Which model version
      feedback?: {
        rating: "positive" | "negative";
        editedResponse?: string;    // If the rep corrected the response
        reason?: string;            // Optional: why it was bad
      };
      latencyMs: number;
      tokenCount: number;
    }

    The editedResponse field is gold. When a rep takes the model's output and fixes it, you get a perfect training pair: the original query is the instruction, and the rep's edit is the ideal completion.

    Drift Detection

    Models don't degrade because they change — they degrade because the *world* changes around them. This is concept drift, and it's the silent killer of production ML systems.

    What Causes Drift in a Sales Companion

  • New products launched — The model doesn't know about them
  • Pricing changes — Confidently quotes last quarter's prices
  • Competitor moves — Battlecard advice is outdated
  • Team changes — New reps, new territories, different question patterns
  • Market shifts — Economic conditions change how reps sell
  • Detecting Drift

    Monitor these signals weekly:

    SignalHow to MeasureAlert Threshold
    Feedback ratiothumbs-down / total feedback> 15% (was < 10%)
    Edit rateresponses edited / total responses> 25% increase from baseline
    Eval harness scoreRun your eval harness on a schedule> 5% drop from best score
    Query distributionCluster new queries, compare to training dataNew cluster > 10% of traffic
    def check_drift(recent_metrics: dict, baseline_metrics: dict) -> list[str]:
        """Return list of drift alerts."""
        alerts = []
        if recent_metrics["negative_rate"] > baseline_metrics["negative_rate"] * 1.5:
            alerts.append("Negative feedback rate increased 50%+")
        if recent_metrics["edit_rate"] > baseline_metrics["edit_rate"] * 1.25:
            alerts.append("Response edit rate increased 25%+")
        if recent_metrics["eval_score"] < baseline_metrics["eval_score"] * 0.95:
            alerts.append("Eval harness score dropped 5%+")
        return alerts

    Retraining Triggers

    Don't retrain on a fixed schedule ("every month"). Retrain when the data tells you to:

    Trigger 1: Accuracy Drop

    Your eval harness score drops below the threshold. This is the clearest signal — the model is getting worse at tasks it used to handle well.

    Trigger 2: Sufficient New Data

    You've accumulated 100+ new high-quality training pairs from user feedback. Retraining with too little new data wastes money; retraining with enough new data improves the model.

    Trigger 3: New Data Source

    The company adopted a new CRM, launched a new product line, or entered a new market. The model's training data doesn't cover this domain. Create training pairs for the new domain and retrain.

    Trigger 4: Concept Drift Detected

    The drift detection system fires alerts. Investigate whether the drift is due to stale knowledge (add new data) or changed behavior expectations (update training examples).

    CI/CD for Models

    Treat model versions like software releases. The same principles apply:

    Version Control

    models/
      sales-companion-v1/
        training_data.jsonl      # 500 examples
        eval_results.json        # Accuracy: 0.82
        model_card.md            # What changed, who approved
      sales-companion-v2/
        training_data.jsonl      # 650 examples (v1 + 150 feedback)
        eval_results.json        # Accuracy: 0.87
        model_card.md

    Staged Rollout Pipeline

    New data collected
          |
          v
    Retrain candidate model
          |
          v
    Run eval harness ──> Fails? ──> Investigate, don't deploy
          |
          v (passes)
    Deploy to canary (5%)
          |
          v (7 days, no regressions)
    Expand to 50%
          |
          v (14 days, metrics stable)
    Full rollout (100%)
          |
          v
    Previous model kept as rollback target

    Rollback

    Always keep the previous model version available. If the new model causes issues in production, rolling back should be a single config change:

    // In your feature flag system
    const MODEL_CONFIG = {
      "sales-companion": {
        active: "ft:gpt-4o-mini:acme:sales-v3:def456",
        rollback: "ft:gpt-4o-mini:acme:sales-v2:abc123",
      }
    };

    Cost Optimization

    As you iterate, costs add up. Here are the levers you can pull:

    Fine-tune a Smaller Model

    If your fine-tuned GPT-4o mini performs as well as base GPT-4o on your specific tasks, you save 10-20x on inference. Fine-tuning *specializes* a smaller model, closing the gap with larger general-purpose models.

    ApproachQualityCost per Query
    Base GPT-4oHigh (general)$0.03
    Base GPT-4o miniMedium$0.003
    Fine-tuned GPT-4o miniHigh (your tasks)$0.005

    The fine-tuned small model gives you 90% of the quality at 15% of the cost. For a Sales Companion handling thousands of queries daily, this is the difference between a $900/month AI bill and a $150/month AI bill.

    Batch Retraining

    Don't retrain after every 10 new examples. Accumulate feedback, then retrain in batches. Each retraining job has a fixed overhead — batching amortizes it.

    Distillation

    Use your best (expensive) model to generate training data, then fine-tune a cheaper model on those outputs. This is distillation — transferring the knowledge of a large model into a small one. It's one of the most cost-effective techniques in production ML.

    GPT-4o (expensive, high quality)
          |
          | Generate 1000 responses
          v
    Training data
          |
          | Fine-tune
          v
    GPT-4o mini (cheap, now also high quality on your tasks)

    Putting It All Together

    The full production pipeline for your Sales Companion:

  • Serve — Fine-tuned model handles rep queries via RAG + fine-tuned generation
  • Log — Every interaction logged with query, response, model version, latency
  • Collect feedback — Thumbs up/down, edits, and implicit signals (did the rep use the response?)
  • Monitor — Drift detection runs weekly, eval harness runs on every candidate model
  • Retrain — When triggers fire, curate new data, retrain, evaluate, staged rollout
  • Repeat — The flywheel turns. Each cycle, the model gets better and the data gets richer.
  • This is how enterprise AI teams operate. Not a one-time project, but a continuous system that compounds over time.

    This is chapter 6 of Fine-tuning for Enterprise AI.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details