4 min

A/B Testing App

Side-by-Side Comparison

Beyond Automated Metrics

Your eval harness gives you numbers. But numbers don't capture everything — does the fine-tuned model *feel* better to use? Does it match the company voice? Is its tone right for a tense negotiation vs a friendly check-in?

This is where human evaluation comes in. You build a simple A/B testing app that lets real users (sales reps, managers, ops) compare outputs blind and tell you which one they prefer.

Blind Comparison UI

The key word is blind. If evaluators know which response came from the fine-tuned model, they'll unconsciously prefer it (you spent weeks on this — of course it's better, right?). Your comparison UI must:

Show the same prompt/query at the top

Display two responses side-by-side, labeled only Response A and Response B

Randomly assign which model is A vs B for each comparison

Ask the evaluator: "Which response is better?" with options: A is better, B is better, Tie

interface Comparison {
  id: string;
  prompt: string;
  responseA: { text: string; model: "base" | "finetuned" };
  responseB: { text: string; model: "base" | "finetuned" };
  // Randomized: sometimes base is A, sometimes B
}

function renderComparison(comparison: Comparison) {
  // Show prompt
  // Show Response A and Response B (no model labels!)
  // Buttons: "A is better" | "Tie" | "B is better"
  // Optional: "Why?" free-text field
}

For the Sales Companion, run this with 5-10 reps evaluating 30-50 comparisons each. That gives you 150-500 data points — enough to draw real conclusions.

ELO Rating System

Instead of simple win percentages, use an ELO rating system (the same system used in chess rankings). ELO handles transitive comparisons well: if Model A beats Model B, and Model B beats Model C, ELO correctly ranks A > B > C even if A and C never faced each other directly.

def update_elo(rating_a: float, rating_b: float, winner: str, k: int = 32) -> tuple:
    """Update ELO ratings after a comparison."""
    expected_a = 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
    expected_b = 1 - expected_a

    if winner == "A":
        score_a, score_b = 1.0, 0.0
    elif winner == "B":
        score_a, score_b = 0.0, 1.0
    else:  # tie
        score_a, score_b = 0.5, 0.5

    new_a = rating_a + k * (score_a - expected_a)
    new_b = rating_b + k * (score_b - expected_b)
    return new_a, new_b

Start both models at 1500 ELO. After 200+ comparisons, the ratings stabilize and give you a clear ranking. This becomes especially useful when you're comparing multiple fine-tuned versions against each other.

Human Evaluation Protocols

Bad evaluation protocols produce bad data. Set clear guidelines:

For Sales Companion Evaluators

Accuracy: Does the response contain correct information?

Helpfulness: Would this response actually help you in a sales conversation?

Tone: Does it sound like something your company would say?

Completeness: Does it cover everything the question asked for?

Give evaluators a rubric, not just "which is better." The rubric makes their judgments consistent and lets you analyze *why* one model wins — maybe the fine-tuned model is more accurate but less complete.

Inter-annotator Agreement

Have at least 20% of comparisons evaluated by multiple people. Measure agreement with Cohen's kappa:

kappa > 0.8 — Strong agreement. Your rubric is clear.

kappa 0.6-0.8 — Moderate. Some ambiguous cases. Acceptable.

kappa < 0.6 — Weak. Your rubric needs work, or the task is genuinely subjective.

Traffic Splitting: Gradual Rollout

You've validated the model offline. Now you need to validate it in production, where real reps use it with real stakes. Don't flip a switch — roll out gradually:

Phase	Traffic to Fine-tuned	Duration	Gate to Next Phase
Canary	5-10%	1 week	No regressions in error rate or latency
Beta	25-50%	2 weeks	User satisfaction >= base model
GA	100%	Ongoing	Continuous monitoring

Implementation

function getModelForRequest(userId: string): string {
  const rolloutPercentage = getFeatureFlag("finetuned_model_rollout"); // 0-100
  const userBucket = hashUserId(userId) % 100;

  if (userBucket < rolloutPercentage) {
    return "ft:gpt-4o-mini:acme:sales-v2:abc123";
  }
  return "gpt-4o-mini"; // base model
}

Log which model served each request so you can compare production metrics between the two groups.

Latency and Cost Comparison

Fine-tuning can change both latency and cost. Measure them head-to-head:

Metric	Base Model	Fine-tuned	Notes
Median latency (p50)	1.2s	1.3s	Fine-tuned is slightly slower (expected)
Tail latency (p99)	3.8s	4.1s	Watch for outliers
Tokens per response	280	220	Fine-tuned is more concise (good!)
Cost per query	$0.004	$0.005	1.5x inference price, but fewer tokens

A fine-tuned model that produces shorter, more targeted responses can actually cost *less* per query despite the higher per-token rate. Track total cost, not just unit price.

When NOT to Fine-tune

Fine-tuning is not always the answer. Save yourself weeks of work by checking these first:

Prompting Might Be Enough When...

You need the model to follow a specific format (try a clear system prompt with examples first)

You have < 50 training examples (not enough data to fine-tune reliably)

The task changes frequently (retraining for every change is expensive)

RAG Might Be Enough When...

The problem is *knowledge*, not *behavior* (the model says the right things, just doesn't know company-specific facts)

Your data changes daily (fine-tuned knowledge is frozen at training time)

You need citations and source attribution (RAG naturally provides these)

Fine-tuning Is the Right Call When...

Prompting works but requires 2,000 tokens of instructions every call (fine-tuning bakes those instructions in, saving tokens and cost)

You need consistent output formatting that the base model struggles with despite detailed prompts

Domain-specific tone or terminology is critical and can't be captured in a system prompt

You have 200+ high-quality training examples and a clear evaluation metric

This is chapter 5 of Fine-tuning for Enterprise AI.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 4: Evaluation Harness

Ch. 6: Production Pipeline