A/B Testing App
Side-by-Side Comparison
Beyond Automated Metrics
Your eval harness gives you numbers. But numbers don't capture everything — does the fine-tuned model *feel* better to use? Does it match the company voice? Is its tone right for a tense negotiation vs a friendly check-in?
This is where human evaluation comes in. You build a simple A/B testing app that lets real users (sales reps, managers, ops) compare outputs blind and tell you which one they prefer.
Blind Comparison UI
The key word is blind. If evaluators know which response came from the fine-tuned model, they'll unconsciously prefer it (you spent weeks on this — of course it's better, right?). Your comparison UI must:
interface Comparison {
id: string;
prompt: string;
responseA: { text: string; model: "base" | "finetuned" };
responseB: { text: string; model: "base" | "finetuned" };
// Randomized: sometimes base is A, sometimes B
}
function renderComparison(comparison: Comparison) {
// Show prompt
// Show Response A and Response B (no model labels!)
// Buttons: "A is better" | "Tie" | "B is better"
// Optional: "Why?" free-text field
}For the Sales Companion, run this with 5-10 reps evaluating 30-50 comparisons each. That gives you 150-500 data points — enough to draw real conclusions.
ELO Rating System
Instead of simple win percentages, use an ELO rating system (the same system used in chess rankings). ELO handles transitive comparisons well: if Model A beats Model B, and Model B beats Model C, ELO correctly ranks A > B > C even if A and C never faced each other directly.
def update_elo(rating_a: float, rating_b: float, winner: str, k: int = 32) -> tuple:
"""Update ELO ratings after a comparison."""
expected_a = 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
expected_b = 1 - expected_a
if winner == "A":
score_a, score_b = 1.0, 0.0
elif winner == "B":
score_a, score_b = 0.0, 1.0
else: # tie
score_a, score_b = 0.5, 0.5
new_a = rating_a + k * (score_a - expected_a)
new_b = rating_b + k * (score_b - expected_b)
return new_a, new_bStart both models at 1500 ELO. After 200+ comparisons, the ratings stabilize and give you a clear ranking. This becomes especially useful when you're comparing multiple fine-tuned versions against each other.
Human Evaluation Protocols
Bad evaluation protocols produce bad data. Set clear guidelines:
For Sales Companion Evaluators
Give evaluators a rubric, not just "which is better." The rubric makes their judgments consistent and lets you analyze *why* one model wins — maybe the fine-tuned model is more accurate but less complete.
Inter-annotator Agreement
Have at least 20% of comparisons evaluated by multiple people. Measure agreement with Cohen's kappa:
Traffic Splitting: Gradual Rollout
You've validated the model offline. Now you need to validate it in production, where real reps use it with real stakes. Don't flip a switch — roll out gradually:
| Phase | Traffic to Fine-tuned | Duration | Gate to Next Phase |
|---|---|---|---|
| Canary | 5-10% | 1 week | No regressions in error rate or latency |
| Beta | 25-50% | 2 weeks | User satisfaction >= base model |
| GA | 100% | Ongoing | Continuous monitoring |
Implementation
function getModelForRequest(userId: string): string {
const rolloutPercentage = getFeatureFlag("finetuned_model_rollout"); // 0-100
const userBucket = hashUserId(userId) % 100;
if (userBucket < rolloutPercentage) {
return "ft:gpt-4o-mini:acme:sales-v2:abc123";
}
return "gpt-4o-mini"; // base model
}Log which model served each request so you can compare production metrics between the two groups.
Latency and Cost Comparison
Fine-tuning can change both latency and cost. Measure them head-to-head:
| Metric | Base Model | Fine-tuned | Notes |
|---|---|---|---|
| Median latency (p50) | 1.2s | 1.3s | Fine-tuned is slightly slower (expected) |
| Tail latency (p99) | 3.8s | 4.1s | Watch for outliers |
| Tokens per response | 280 | 220 | Fine-tuned is more concise (good!) |
| Cost per query | $0.004 | $0.005 | 1.5x inference price, but fewer tokens |
A fine-tuned model that produces shorter, more targeted responses can actually cost *less* per query despite the higher per-token rate. Track total cost, not just unit price.
When NOT to Fine-tune
Fine-tuning is not always the answer. Save yourself weeks of work by checking these first:
Prompting Might Be Enough When...
RAG Might Be Enough When...
Fine-tuning Is the Right Call When...
This is chapter 5 of Fine-tuning for Enterprise AI.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details