Back to guides
4
5 min

Evaluation Harness

Measuring What Matters

Build an Eval, Not a Vibe Check

"It looks better" is not a metric. After spending time and money on fine-tuning, you need to *prove* the model improved — to yourself, to your team, and to the stakeholders funding this work.

An evaluation harness is a repeatable pipeline that runs your model against a fixed test set and produces numerical scores. Every time you retrain, you run the same harness and compare. No guessing, no cherry-picking examples.

The Sales Companion Eval Pipeline

Test dataset (100+ examples)
        |
        v
Run base model ──────> Base predictions
Run fine-tuned model ──> FT predictions
        |
        v
Score both against gold labels
        |
        v
Compare: FT accuracy vs Base accuracy

Task-specific Metrics

Different tasks need different metrics. The Sales Companion handles multiple task types, so your eval harness needs multiple scorers:

Classification Tasks

Metric: Accuracy, F1 Score

For tasks like "classify this support ticket" or "identify the deal stage":

from sklearn.metrics import classification_report

# Gold labels from your test set
gold = ["billing", "technical", "billing", "feature_request", "technical"]
# Model predictions
predicted = ["billing", "technical", "billing", "billing", "technical"]

print(classification_report(gold, predicted))
#               precision  recall  f1-score
# billing          0.67    1.00     0.80
# feature_request  0.00    0.00     0.00
# technical        1.00    1.00     1.00

F1 is better than raw accuracy when classes are imbalanced — if 90% of tickets are "billing," a model that always guesses "billing" gets 90% accuracy but is useless.

Summarization Tasks

Metric: ROUGE, BERTScore

For call summaries, deal recaps, meeting notes:

  • ROUGE-L measures the longest common subsequence between the model's summary and the reference. Simple, fast, but misses semantic similarity.
  • BERTScore uses embeddings to measure semantic overlap. Catches cases where the model says the same thing with different words.
  • from rouge_score import rouge_scorer
    scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
    
    reference = "Globex is at risk due to pricing pressure. Next step: ROI presentation."
    candidate = "The Globex account faces pricing competition. Recommend presenting ROI data."
    
    scores = scorer.score(reference, candidate)
    print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")  # ~0.45

    For the Sales Companion, BERTScore is usually more informative — reps don't need verbatim matches, they need semantically correct summaries.

    Structured Output Tasks

    Metric: Exact Match, Field-level Accuracy

    For battlecards, JSON outputs, structured deal summaries:

    import json
    
    def field_accuracy(gold_json: dict, predicted_json: dict) -> float:
        """What percentage of fields match exactly?"""
        fields = gold_json.keys()
        matches = sum(1 for f in fields if gold_json[f] == predicted_json.get(f))
        return matches / len(fields)
    
    gold = {"deal_stage": "negotiation", "risk": "high", "next_step": "send proposal"}
    pred = {"deal_stage": "negotiation", "risk": "medium", "next_step": "send proposal"}
    
    print(f"Field accuracy: {field_accuracy(gold, pred):.0%}")  # 67%

    Hallucination Detection

    The most dangerous failure mode for a Sales Companion: confidently stating something that isn't in the source documents.

    Claim Verification

    Extract factual claims from the model's output, then check each claim against the source documents:

  • Extract claims: "Our enterprise plan costs $500/month" is a verifiable claim. "I'd recommend focusing on value" is not.
  • Search sources: Does any source document contain this pricing information?
  • Verdict: Supported, contradicted, or not found in sources.
  • def check_hallucination(response: str, source_docs: list[str]) -> dict:
        """Use a judge model to verify claims against sources."""
        prompt = f"""Given these source documents:
    {chr(10).join(source_docs)}
    
    And this model response:
    {response}
    
    List each factual claim in the response and whether it is:
    - SUPPORTED: clearly stated in the sources
    - CONTRADICTED: conflicts with the sources
    - UNVERIFIABLE: not mentioned in the sources
    
    Return as JSON array."""
    
        # Run through a strong judge model (e.g., GPT-4o or Claude)
        result = judge_model.complete(prompt)
        return parse_json(result)

    Target: < 5% hallucination rate. If your fine-tuned model hallucinates more than the base model, something went wrong in training — likely noisy or incorrect training data.

    Building Eval Datasets

    Your eval dataset is the most important asset in this entire process. It must be:

  • Human-labeled — Not generated by AI. Real humans (ideally domain experts) writing gold-standard outputs.
  • Representative — Covers all task types in proportion to real usage.
  • Fixed — Never changes between evaluations. If you improve the eval set, version it and re-run previous models against the new version.
  • Large enough — At least 50 examples per task type. 100+ is better.
  • Building It for the Sales Companion

  • Sample 200 real queries from production logs
  • Have 2 sales ops people independently write ideal responses
  • Reconcile disagreements (this surfaces ambiguous cases — good!)
  • Lock the dataset. Version it. Never train on it.
  • Base vs Fine-tuned Comparison

    Always compare against the base model, not against nothing. Structure your results like this:

    Task TypeBase ModelFine-tunedDelta
    Classification (F1)0.720.89+0.17
    Summarization (BERTScore)0.810.86+0.05
    Structured output (field accuracy)0.650.91+0.26
    Hallucination rate8.2%3.1%-5.1%

    This table tells the full story: where fine-tuning helped, where it didn't, and whether the investment was worth it.

    Statistical Significance

    With 50 test examples, a 2% accuracy improvement could easily be noise. Use bootstrapped confidence intervals to check:

    import numpy as np
    
    def bootstrap_ci(scores, n_bootstrap=1000, ci=0.95):
        """Return the 95% confidence interval for the mean."""
        means = [np.mean(np.random.choice(scores, size=len(scores), replace=True))
                 for _ in range(n_bootstrap)]
        lower = np.percentile(means, (1 - ci) / 2 * 100)
        upper = np.percentile(means, (1 + ci) / 2 * 100)
        return lower, upper
    
    base_scores = [0.8, 0.7, 0.9, ...]    # per-example scores
    ft_scores = [0.85, 0.82, 0.91, ...]
    
    base_ci = bootstrap_ci(base_scores)
    ft_ci = bootstrap_ci(ft_scores)
    
    print(f"Base: {np.mean(base_scores):.3f} ({base_ci[0]:.3f}-{base_ci[1]:.3f})")
    print(f"FT:   {np.mean(ft_scores):.3f} ({ft_ci[0]:.3f}-{ft_ci[1]:.3f})")

    If the confidence intervals overlap, the improvement is not statistically significant. You either need more test data or the fine-tuning didn't help enough on that task.

    This is chapter 4 of Fine-tuning for Enterprise AI.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details