Evaluation Harness
Measuring What Matters
Build an Eval, Not a Vibe Check
"It looks better" is not a metric. After spending time and money on fine-tuning, you need to *prove* the model improved — to yourself, to your team, and to the stakeholders funding this work.
An evaluation harness is a repeatable pipeline that runs your model against a fixed test set and produces numerical scores. Every time you retrain, you run the same harness and compare. No guessing, no cherry-picking examples.
The Sales Companion Eval Pipeline
Test dataset (100+ examples)
|
v
Run base model ──────> Base predictions
Run fine-tuned model ──> FT predictions
|
v
Score both against gold labels
|
v
Compare: FT accuracy vs Base accuracyTask-specific Metrics
Different tasks need different metrics. The Sales Companion handles multiple task types, so your eval harness needs multiple scorers:
Classification Tasks
Metric: Accuracy, F1 Score
For tasks like "classify this support ticket" or "identify the deal stage":
from sklearn.metrics import classification_report
# Gold labels from your test set
gold = ["billing", "technical", "billing", "feature_request", "technical"]
# Model predictions
predicted = ["billing", "technical", "billing", "billing", "technical"]
print(classification_report(gold, predicted))
# precision recall f1-score
# billing 0.67 1.00 0.80
# feature_request 0.00 0.00 0.00
# technical 1.00 1.00 1.00F1 is better than raw accuracy when classes are imbalanced — if 90% of tickets are "billing," a model that always guesses "billing" gets 90% accuracy but is useless.
Summarization Tasks
Metric: ROUGE, BERTScore
For call summaries, deal recaps, meeting notes:
from rouge_score import rouge_scorer
scorer = rouge_scorer.RougeScorer(['rougeL'], use_stemmer=True)
reference = "Globex is at risk due to pricing pressure. Next step: ROI presentation."
candidate = "The Globex account faces pricing competition. Recommend presenting ROI data."
scores = scorer.score(reference, candidate)
print(f"ROUGE-L: {scores['rougeL'].fmeasure:.3f}") # ~0.45For the Sales Companion, BERTScore is usually more informative — reps don't need verbatim matches, they need semantically correct summaries.
Structured Output Tasks
Metric: Exact Match, Field-level Accuracy
For battlecards, JSON outputs, structured deal summaries:
import json
def field_accuracy(gold_json: dict, predicted_json: dict) -> float:
"""What percentage of fields match exactly?"""
fields = gold_json.keys()
matches = sum(1 for f in fields if gold_json[f] == predicted_json.get(f))
return matches / len(fields)
gold = {"deal_stage": "negotiation", "risk": "high", "next_step": "send proposal"}
pred = {"deal_stage": "negotiation", "risk": "medium", "next_step": "send proposal"}
print(f"Field accuracy: {field_accuracy(gold, pred):.0%}") # 67%Hallucination Detection
The most dangerous failure mode for a Sales Companion: confidently stating something that isn't in the source documents.
Claim Verification
Extract factual claims from the model's output, then check each claim against the source documents:
def check_hallucination(response: str, source_docs: list[str]) -> dict:
"""Use a judge model to verify claims against sources."""
prompt = f"""Given these source documents:
{chr(10).join(source_docs)}
And this model response:
{response}
List each factual claim in the response and whether it is:
- SUPPORTED: clearly stated in the sources
- CONTRADICTED: conflicts with the sources
- UNVERIFIABLE: not mentioned in the sources
Return as JSON array."""
# Run through a strong judge model (e.g., GPT-4o or Claude)
result = judge_model.complete(prompt)
return parse_json(result)Target: < 5% hallucination rate. If your fine-tuned model hallucinates more than the base model, something went wrong in training — likely noisy or incorrect training data.
Building Eval Datasets
Your eval dataset is the most important asset in this entire process. It must be:
Building It for the Sales Companion
Base vs Fine-tuned Comparison
Always compare against the base model, not against nothing. Structure your results like this:
| Task Type | Base Model | Fine-tuned | Delta |
|---|---|---|---|
| Classification (F1) | 0.72 | 0.89 | +0.17 |
| Summarization (BERTScore) | 0.81 | 0.86 | +0.05 |
| Structured output (field accuracy) | 0.65 | 0.91 | +0.26 |
| Hallucination rate | 8.2% | 3.1% | -5.1% |
This table tells the full story: where fine-tuning helped, where it didn't, and whether the investment was worth it.
Statistical Significance
With 50 test examples, a 2% accuracy improvement could easily be noise. Use bootstrapped confidence intervals to check:
import numpy as np
def bootstrap_ci(scores, n_bootstrap=1000, ci=0.95):
"""Return the 95% confidence interval for the mean."""
means = [np.mean(np.random.choice(scores, size=len(scores), replace=True))
for _ in range(n_bootstrap)]
lower = np.percentile(means, (1 - ci) / 2 * 100)
upper = np.percentile(means, (1 + ci) / 2 * 100)
return lower, upper
base_scores = [0.8, 0.7, 0.9, ...] # per-example scores
ft_scores = [0.85, 0.82, 0.91, ...]
base_ci = bootstrap_ci(base_scores)
ft_ci = bootstrap_ci(ft_scores)
print(f"Base: {np.mean(base_scores):.3f} ({base_ci[0]:.3f}-{base_ci[1]:.3f})")
print(f"FT: {np.mean(ft_scores):.3f} ({ft_ci[0]:.3f}-{ft_ci[1]:.3f})")If the confidence intervals overlap, the improvement is not statistically significant. You either need more test data or the fine-tuning didn't help enough on that task.
This is chapter 4 of Fine-tuning for Enterprise AI.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details