4 min

Production Pipeline

The Data Flywheel

From One-shot to Continuous

Fine-tuning your Sales Companion once is a project. Keeping it sharp over time is a system. This chapter covers how to build that system — the infrastructure that turns user feedback into model improvements automatically.

The Data Flywheel

The most powerful concept in applied ML is the data flywheel: a self-reinforcing loop where your product generates the data that makes the product better.

  More users ──> More queries + feedback
       ^                    |
       |                    v
  Better model <── Curated training data

For the Sales Companion, this flywheel looks like:

Reps use the tool — generating queries, receiving responses

Reps give feedback — thumbs up/down, edits to responses, choosing one suggestion over another

Feedback becomes training data — thumbs-up responses are positive examples, edited responses become improved completions

Retrain the model — periodically fine-tune on the accumulated feedback

Better model attracts more usage — reps trust it more, use it for harder tasks, generating richer feedback

Logging for the Flywheel

Every interaction should log:

interface InteractionLog {
  timestamp: string;
  userId: string;
  query: string;
  retrievedDocs: string[];      // What RAG surfaced
  modelResponse: string;        // What the model said
  modelId: string;              // Which model version
  feedback?: {
    rating: "positive" | "negative";
    editedResponse?: string;    // If the rep corrected the response
    reason?: string;            // Optional: why it was bad
  };
  latencyMs: number;
  tokenCount: number;
}

The editedResponse field is gold. When a rep takes the model's output and fixes it, you get a perfect training pair: the original query is the instruction, and the rep's edit is the ideal completion.

Drift Detection

Models don't degrade because they change — they degrade because the *world* changes around them. This is concept drift, and it's the silent killer of production ML systems.

What Causes Drift in a Sales Companion

New products launched — The model doesn't know about them

Pricing changes — Confidently quotes last quarter's prices

Competitor moves — Battlecard advice is outdated

Team changes — New reps, new territories, different question patterns

Market shifts — Economic conditions change how reps sell

Detecting Drift

Monitor these signals weekly:

Signal	How to Measure	Alert Threshold
Feedback ratio	thumbs-down / total feedback	> 15% (was < 10%)
Edit rate	responses edited / total responses	> 25% increase from baseline
Eval harness score	Run your eval harness on a schedule	> 5% drop from best score
Query distribution	Cluster new queries, compare to training data	New cluster > 10% of traffic

def check_drift(recent_metrics: dict, baseline_metrics: dict) -> list[str]:
    """Return list of drift alerts."""
    alerts = []
    if recent_metrics["negative_rate"] > baseline_metrics["negative_rate"] * 1.5:
        alerts.append("Negative feedback rate increased 50%+")
    if recent_metrics["edit_rate"] > baseline_metrics["edit_rate"] * 1.25:
        alerts.append("Response edit rate increased 25%+")
    if recent_metrics["eval_score"] < baseline_metrics["eval_score"] * 0.95:
        alerts.append("Eval harness score dropped 5%+")
    return alerts

Retraining Triggers

Don't retrain on a fixed schedule ("every month"). Retrain when the data tells you to:

Trigger 1: Accuracy Drop

Your eval harness score drops below the threshold. This is the clearest signal — the model is getting worse at tasks it used to handle well.

Trigger 2: Sufficient New Data

You've accumulated 100+ new high-quality training pairs from user feedback. Retraining with too little new data wastes money; retraining with enough new data improves the model.

Trigger 3: New Data Source

The company adopted a new CRM, launched a new product line, or entered a new market. The model's training data doesn't cover this domain. Create training pairs for the new domain and retrain.

Trigger 4: Concept Drift Detected

The drift detection system fires alerts. Investigate whether the drift is due to stale knowledge (add new data) or changed behavior expectations (update training examples).

CI/CD for Models

Treat model versions like software releases. The same principles apply:

Version Control

models/
  sales-companion-v1/
    training_data.jsonl      # 500 examples
    eval_results.json        # Accuracy: 0.82
    model_card.md            # What changed, who approved
  sales-companion-v2/
    training_data.jsonl      # 650 examples (v1 + 150 feedback)
    eval_results.json        # Accuracy: 0.87
    model_card.md

Staged Rollout Pipeline

New data collected
      |
      v
Retrain candidate model
      |
      v
Run eval harness ──> Fails? ──> Investigate, don't deploy
      |
      v (passes)
Deploy to canary (5%)
      |
      v (7 days, no regressions)
Expand to 50%
      |
      v (14 days, metrics stable)
Full rollout (100%)
      |
      v
Previous model kept as rollback target

Rollback

Always keep the previous model version available. If the new model causes issues in production, rolling back should be a single config change:

// In your feature flag system
const MODEL_CONFIG = {
  "sales-companion": {
    active: "ft:gpt-4o-mini:acme:sales-v3:def456",
    rollback: "ft:gpt-4o-mini:acme:sales-v2:abc123",
  }
};

Cost Optimization

As you iterate, costs add up. Here are the levers you can pull:

Fine-tune a Smaller Model

If your fine-tuned GPT-4o mini performs as well as base GPT-4o on your specific tasks, you save 10-20x on inference. Fine-tuning *specializes* a smaller model, closing the gap with larger general-purpose models.

Approach	Quality	Cost per Query
Base GPT-4o	High (general)	$0.03
Base GPT-4o mini	Medium	$0.003
Fine-tuned GPT-4o mini	High (your tasks)	$0.005

The fine-tuned small model gives you 90% of the quality at 15% of the cost. For a Sales Companion handling thousands of queries daily, this is the difference between a $900/month AI bill and a $150/month AI bill.

Batch Retraining

Don't retrain after every 10 new examples. Accumulate feedback, then retrain in batches. Each retraining job has a fixed overhead — batching amortizes it.

Distillation

Use your best (expensive) model to generate training data, then fine-tune a cheaper model on those outputs. This is distillation — transferring the knowledge of a large model into a small one. It's one of the most cost-effective techniques in production ML.

GPT-4o (expensive, high quality)
      |
      | Generate 1000 responses
      v
Training data
      |
      | Fine-tune
      v
GPT-4o mini (cheap, now also high quality on your tasks)

Putting It All Together

The full production pipeline for your Sales Companion:

Serve — Fine-tuned model handles rep queries via RAG + fine-tuned generation

Log — Every interaction logged with query, response, model version, latency

Collect feedback — Thumbs up/down, edits, and implicit signals (did the rep use the response?)

Monitor — Drift detection runs weekly, eval harness runs on every candidate model

Retrain — When triggers fire, curate new data, retrain, evaluate, staged rollout

Repeat — The flywheel turns. Each cycle, the model gets better and the data gets richer.

This is how enterprise AI teams operate. Not a one-time project, but a continuous system that compounds over time.

This is chapter 6 of Fine-tuning for Enterprise AI.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 5: A/B Testing App