Back to guides
5
4 min

A/B Testing App

Side-by-Side Comparison

Beyond Automated Metrics

Your eval harness gives you numbers. But numbers don't capture everything — does the fine-tuned model *feel* better to use? Does it match the company voice? Is its tone right for a tense negotiation vs a friendly check-in?

This is where human evaluation comes in. You build a simple A/B testing app that lets real users (sales reps, managers, ops) compare outputs blind and tell you which one they prefer.

Blind Comparison UI

The key word is blind. If evaluators know which response came from the fine-tuned model, they'll unconsciously prefer it (you spent weeks on this — of course it's better, right?). Your comparison UI must:

  • Show the same prompt/query at the top
  • Display two responses side-by-side, labeled only Response A and Response B
  • Randomly assign which model is A vs B for each comparison
  • Ask the evaluator: "Which response is better?" with options: A is better, B is better, Tie
  • interface Comparison {
      id: string;
      prompt: string;
      responseA: { text: string; model: "base" | "finetuned" };
      responseB: { text: string; model: "base" | "finetuned" };
      // Randomized: sometimes base is A, sometimes B
    }
    
    function renderComparison(comparison: Comparison) {
      // Show prompt
      // Show Response A and Response B (no model labels!)
      // Buttons: "A is better" | "Tie" | "B is better"
      // Optional: "Why?" free-text field
    }

    For the Sales Companion, run this with 5-10 reps evaluating 30-50 comparisons each. That gives you 150-500 data points — enough to draw real conclusions.

    ELO Rating System

    Instead of simple win percentages, use an ELO rating system (the same system used in chess rankings). ELO handles transitive comparisons well: if Model A beats Model B, and Model B beats Model C, ELO correctly ranks A > B > C even if A and C never faced each other directly.

    def update_elo(rating_a: float, rating_b: float, winner: str, k: int = 32) -> tuple:
        """Update ELO ratings after a comparison."""
        expected_a = 1 / (1 + 10 ** ((rating_b - rating_a) / 400))
        expected_b = 1 - expected_a
    
        if winner == "A":
            score_a, score_b = 1.0, 0.0
        elif winner == "B":
            score_a, score_b = 0.0, 1.0
        else:  # tie
            score_a, score_b = 0.5, 0.5
    
        new_a = rating_a + k * (score_a - expected_a)
        new_b = rating_b + k * (score_b - expected_b)
        return new_a, new_b

    Start both models at 1500 ELO. After 200+ comparisons, the ratings stabilize and give you a clear ranking. This becomes especially useful when you're comparing multiple fine-tuned versions against each other.

    Human Evaluation Protocols

    Bad evaluation protocols produce bad data. Set clear guidelines:

    For Sales Companion Evaluators

  • Accuracy: Does the response contain correct information?
  • Helpfulness: Would this response actually help you in a sales conversation?
  • Tone: Does it sound like something your company would say?
  • Completeness: Does it cover everything the question asked for?
  • Give evaluators a rubric, not just "which is better." The rubric makes their judgments consistent and lets you analyze *why* one model wins — maybe the fine-tuned model is more accurate but less complete.

    Inter-annotator Agreement

    Have at least 20% of comparisons evaluated by multiple people. Measure agreement with Cohen's kappa:

  • kappa > 0.8 — Strong agreement. Your rubric is clear.
  • kappa 0.6-0.8 — Moderate. Some ambiguous cases. Acceptable.
  • kappa < 0.6 — Weak. Your rubric needs work, or the task is genuinely subjective.
  • Traffic Splitting: Gradual Rollout

    You've validated the model offline. Now you need to validate it in production, where real reps use it with real stakes. Don't flip a switch — roll out gradually:

    PhaseTraffic to Fine-tunedDurationGate to Next Phase
    Canary5-10%1 weekNo regressions in error rate or latency
    Beta25-50%2 weeksUser satisfaction >= base model
    GA100%OngoingContinuous monitoring

    Implementation

    function getModelForRequest(userId: string): string {
      const rolloutPercentage = getFeatureFlag("finetuned_model_rollout"); // 0-100
      const userBucket = hashUserId(userId) % 100;
    
      if (userBucket < rolloutPercentage) {
        return "ft:gpt-4o-mini:acme:sales-v2:abc123";
      }
      return "gpt-4o-mini"; // base model
    }

    Log which model served each request so you can compare production metrics between the two groups.

    Latency and Cost Comparison

    Fine-tuning can change both latency and cost. Measure them head-to-head:

    MetricBase ModelFine-tunedNotes
    Median latency (p50)1.2s1.3sFine-tuned is slightly slower (expected)
    Tail latency (p99)3.8s4.1sWatch for outliers
    Tokens per response280220Fine-tuned is more concise (good!)
    Cost per query$0.004$0.0051.5x inference price, but fewer tokens

    A fine-tuned model that produces shorter, more targeted responses can actually cost *less* per query despite the higher per-token rate. Track total cost, not just unit price.

    When NOT to Fine-tune

    Fine-tuning is not always the answer. Save yourself weeks of work by checking these first:

    Prompting Might Be Enough When...

  • You need the model to follow a specific format (try a clear system prompt with examples first)
  • You have < 50 training examples (not enough data to fine-tune reliably)
  • The task changes frequently (retraining for every change is expensive)
  • RAG Might Be Enough When...

  • The problem is *knowledge*, not *behavior* (the model says the right things, just doesn't know company-specific facts)
  • Your data changes daily (fine-tuned knowledge is frozen at training time)
  • You need citations and source attribution (RAG naturally provides these)
  • Fine-tuning Is the Right Call When...

  • Prompting works but requires 2,000 tokens of instructions every call (fine-tuning bakes those instructions in, saving tokens and cost)
  • You need consistent output formatting that the base model struggles with despite detailed prompts
  • Domain-specific tone or terminology is critical and can't be captured in a system prompt
  • You have 200+ high-quality training examples and a clear evaluation metric
  • This is chapter 5 of Fine-tuning for Enterprise AI.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details