Production Pipeline
The Data Flywheel
From One-shot to Continuous
Fine-tuning your Sales Companion once is a project. Keeping it sharp over time is a system. This chapter covers how to build that system — the infrastructure that turns user feedback into model improvements automatically.
The Data Flywheel
The most powerful concept in applied ML is the data flywheel: a self-reinforcing loop where your product generates the data that makes the product better.
More users ──> More queries + feedback
^ |
| v
Better model <── Curated training dataFor the Sales Companion, this flywheel looks like:
Logging for the Flywheel
Every interaction should log:
interface InteractionLog {
timestamp: string;
userId: string;
query: string;
retrievedDocs: string[]; // What RAG surfaced
modelResponse: string; // What the model said
modelId: string; // Which model version
feedback?: {
rating: "positive" | "negative";
editedResponse?: string; // If the rep corrected the response
reason?: string; // Optional: why it was bad
};
latencyMs: number;
tokenCount: number;
}The editedResponse field is gold. When a rep takes the model's output and fixes it, you get a perfect training pair: the original query is the instruction, and the rep's edit is the ideal completion.
Drift Detection
Models don't degrade because they change — they degrade because the *world* changes around them. This is concept drift, and it's the silent killer of production ML systems.
What Causes Drift in a Sales Companion
Detecting Drift
Monitor these signals weekly:
| Signal | How to Measure | Alert Threshold |
|---|---|---|
| Feedback ratio | thumbs-down / total feedback | > 15% (was < 10%) |
| Edit rate | responses edited / total responses | > 25% increase from baseline |
| Eval harness score | Run your eval harness on a schedule | > 5% drop from best score |
| Query distribution | Cluster new queries, compare to training data | New cluster > 10% of traffic |
def check_drift(recent_metrics: dict, baseline_metrics: dict) -> list[str]:
"""Return list of drift alerts."""
alerts = []
if recent_metrics["negative_rate"] > baseline_metrics["negative_rate"] * 1.5:
alerts.append("Negative feedback rate increased 50%+")
if recent_metrics["edit_rate"] > baseline_metrics["edit_rate"] * 1.25:
alerts.append("Response edit rate increased 25%+")
if recent_metrics["eval_score"] < baseline_metrics["eval_score"] * 0.95:
alerts.append("Eval harness score dropped 5%+")
return alertsRetraining Triggers
Don't retrain on a fixed schedule ("every month"). Retrain when the data tells you to:
Trigger 1: Accuracy Drop
Your eval harness score drops below the threshold. This is the clearest signal — the model is getting worse at tasks it used to handle well.
Trigger 2: Sufficient New Data
You've accumulated 100+ new high-quality training pairs from user feedback. Retraining with too little new data wastes money; retraining with enough new data improves the model.
Trigger 3: New Data Source
The company adopted a new CRM, launched a new product line, or entered a new market. The model's training data doesn't cover this domain. Create training pairs for the new domain and retrain.
Trigger 4: Concept Drift Detected
The drift detection system fires alerts. Investigate whether the drift is due to stale knowledge (add new data) or changed behavior expectations (update training examples).
CI/CD for Models
Treat model versions like software releases. The same principles apply:
Version Control
models/
sales-companion-v1/
training_data.jsonl # 500 examples
eval_results.json # Accuracy: 0.82
model_card.md # What changed, who approved
sales-companion-v2/
training_data.jsonl # 650 examples (v1 + 150 feedback)
eval_results.json # Accuracy: 0.87
model_card.mdStaged Rollout Pipeline
New data collected
|
v
Retrain candidate model
|
v
Run eval harness ──> Fails? ──> Investigate, don't deploy
|
v (passes)
Deploy to canary (5%)
|
v (7 days, no regressions)
Expand to 50%
|
v (14 days, metrics stable)
Full rollout (100%)
|
v
Previous model kept as rollback targetRollback
Always keep the previous model version available. If the new model causes issues in production, rolling back should be a single config change:
// In your feature flag system
const MODEL_CONFIG = {
"sales-companion": {
active: "ft:gpt-4o-mini:acme:sales-v3:def456",
rollback: "ft:gpt-4o-mini:acme:sales-v2:abc123",
}
};Cost Optimization
As you iterate, costs add up. Here are the levers you can pull:
Fine-tune a Smaller Model
If your fine-tuned GPT-4o mini performs as well as base GPT-4o on your specific tasks, you save 10-20x on inference. Fine-tuning *specializes* a smaller model, closing the gap with larger general-purpose models.
| Approach | Quality | Cost per Query |
|---|---|---|
| Base GPT-4o | High (general) | $0.03 |
| Base GPT-4o mini | Medium | $0.003 |
| Fine-tuned GPT-4o mini | High (your tasks) | $0.005 |
The fine-tuned small model gives you 90% of the quality at 15% of the cost. For a Sales Companion handling thousands of queries daily, this is the difference between a $900/month AI bill and a $150/month AI bill.
Batch Retraining
Don't retrain after every 10 new examples. Accumulate feedback, then retrain in batches. Each retraining job has a fixed overhead — batching amortizes it.
Distillation
Use your best (expensive) model to generate training data, then fine-tune a cheaper model on those outputs. This is distillation — transferring the knowledge of a large model into a small one. It's one of the most cost-effective techniques in production ML.
GPT-4o (expensive, high quality)
|
| Generate 1000 responses
v
Training data
|
| Fine-tune
v
GPT-4o mini (cheap, now also high quality on your tasks)Putting It All Together
The full production pipeline for your Sales Companion:
This is how enterprise AI teams operate. Not a one-time project, but a continuous system that compounds over time.
This is chapter 6 of Fine-tuning for Enterprise AI.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details