5 min

Evaluation Harness

Measuring What Matters

Build an Eval, Not a Vibe Check

"It looks better" is not a metric. After spending time and money on fine-tuning, you need to *prove* the model improved — to yourself, to your team, and to the stakeholders funding this work.

An evaluation harness is a repeatable pipeline that runs your model against a fixed test set and produces numerical scores. Every time you retrain, you run the same harness and compare. No guessing, no cherry-picking examples.

The Sales Companion Eval Pipeline

Test dataset (100+ examples)
        |
        v
Run base model ──────> Base predictions
Run fine-tuned model ──> FT predictions
        |
        v
Score both against gold labels
        |
        v
Compare: FT accuracy vs Base accuracy

Task-specific Metrics

Different tasks need different metrics. The Sales Companion handles multiple task types, so your eval harness needs multiple scorers:

Classification Tasks

Metric: Accuracy, F1 Score

For tasks like "classify this support ticket" or "identify the deal stage":

from sklearn.metrics import classification_report

# Gold labels from your test set
gold = ["billing", "technical", "billing", "feature_request", "technical"]
# Model predictions
predicted = ["billing", "technical", "billing", "billing", "technical"]

print(classification_report(gold, predicted))
#               precision  recall  f1-score
# billing          0.67    1.00     0.80
# feature_request  0.00    0.00     0.00
# technical        1.00    1.00     1.00

F1 is better than raw accuracy when classes are imbalanced — if 90% of tickets are "billing," a model that always guesses "billing" gets 90% accuracy but is useless.

Summarization Tasks

Metric: ROUGE, BERTScore

For call summaries, deal recaps, meeting notes:

ROUGE-L measures the longest common subsequence between the model's summary and the reference. Simple, fast, but misses semantic similarity.

BERTScore uses embeddings to measure semantic overlap. Catches cases where the model says the same thing with different words.

from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)

reference = "Globex is at risk due to pricing pressure. Next step: ROI presentation."
candidate = "The Globex account faces pricing competition. Recommend presenting ROI data."

scores = scorer.score(reference, candidate)
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}")  # ~0.45

For the Sales Companion, BERTScore is usually more informative — reps don't need verbatim matches, they need semantically correct summaries.

Structured Output Tasks

Metric: Exact Match, Field-level Accuracy

For battlecards, JSON outputs, structured deal summaries:

import json

def field_accuracy(gold_json: dict, predicted_json: dict) -> float:
    """What percentage of fields match exactly?"""
    fields = gold_json.keys()
    matches = sum(1 for f in fields if gold_json[f] == predicted_json.get(f))
    return matches / len(fields)

gold = {"deal_stage": "negotiation", "risk": "high", "next_step": "send proposal"}
pred = {"deal_stage": "negotiation", "risk": "medium", "next_step": "send proposal"}

print(f"Field accuracy: {field_accuracy(gold, pred):.0%}")  # 67%

Hallucination Detection

The most dangerous failure mode for a Sales Companion: confidently stating something that isn't in the source documents.

Claim Verification

Extract factual claims from the model's output, then check each claim against the source documents:

Extract claims: "Our enterprise plan costs $500/month" is a verifiable claim. "I'd recommend focusing on value" is not.

Search sources: Does any source document contain this pricing information?

Verdict: Supported, contradicted, or not found in sources.

def check_hallucination(response: str, source_docs: list[str]) -> dict:
    """Use a judge model to verify claims against sources."""
    prompt = f"""Given these source documents:
{chr(10).join(source_docs)}

And this model response:
{response}

List each factual claim in the response and whether it is:
- SUPPORTED: clearly stated in the sources
- CONTRADICTED: conflicts with the sources
- UNVERIFIABLE: not mentioned in the sources

Return as JSON array."""

    # Run through a strong judge model (e.g., GPT-4o or Claude)
    result = judge_model.complete(prompt)
    return parse_json(result)

Target: < 5% hallucination rate. If your fine-tuned model hallucinates more than the base model, something went wrong in training — likely noisy or incorrect training data.

Building Eval Datasets

Your eval dataset is the most important asset in this entire process. It must be:

Human-labeled — Not generated by AI. Real humans (ideally domain experts) writing gold-standard outputs.

Representative — Covers all task types in proportion to real usage.

Fixed — Never changes between evaluations. If you improve the eval set, version it and re-run previous models against the new version.

Large enough — At least 50 examples per task type. 100+ is better.

Building It for the Sales Companion

Sample 200 real queries from production logs

Have 2 sales ops people independently write ideal responses

Reconcile disagreements (this surfaces ambiguous cases — good!)

Lock the dataset. Version it. Never train on it.

Base vs Fine-tuned Comparison

Always compare against the base model, not against nothing. Structure your results like this:

Task Type	Base Model	Fine-tuned	Delta
Classification (F1)	0.72	0.89	+0.17
Summarization (BERTScore)	0.81	0.86	+0.05
Structured output (field accuracy)	0.65	0.91	+0.26
Hallucination rate	8.2%	3.1%	-5.1%

This table tells the full story: where fine-tuning helped, where it didn't, and whether the investment was worth it.

Statistical Significance

With 50 test examples, a 2% accuracy improvement could easily be noise. Use bootstrapped confidence intervals to check:

import numpy as np

def bootstrap_ci(scores, n_bootstrap=1000, ci=0.95):
    """Return the 95% confidence interval for the mean."""
    means = [np.mean(np.random.choice(scores, size=len(scores), replace=True))
             for _ in range(n_bootstrap)]
    lower = np.percentile(means, (1 - ci) / 2 * 100)
    upper = np.percentile(means, (1 + ci) / 2 * 100)
    return lower, upper

base_scores = [0.8, 0.7, 0.9, ...]    # per-example scores
ft_scores = [0.85, 0.82, 0.91, ...]

base_ci = bootstrap_ci(base_scores)
ft_ci = bootstrap_ci(ft_scores)

print(f"Base: {np.mean(base_scores):.3f} ({base_ci[0]:.3f}-{base_ci[1]:.3f})")
print(f"FT:   {np.mean(ft_scores):.3f} ({ft_ci[0]:.3f}-{ft_ci[1]:.3f})")

If the confidence intervals overlap, the improvement is not statistically significant. You either need more test data or the fine-tuning didn't help enough on that task.

This is chapter 4 of Fine-tuning for Enterprise AI.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 3: Fine-tuning Execution

Ch. 5: A/B Testing App