Dataset Curation
From Raw Data to Training Pairs
Why Fine-tune on Top of RAG?
Your Sales Companion already retrieves relevant documents and feeds them to the LLM. That works well for factual recall — "What's the pricing for Plan X?" — but breaks down in three specific scenarios:
The rule of thumb: RAG handles *what* the model knows. Fine-tuning handles *how* the model communicates.
Training Pair Anatomy
Every fine-tuning example is a pair:
| Component | What It Is | Sales Companion Example |
|---|---|---|
| Instruction | The input the model receives | "Summarize this call transcript for the deal review meeting" |
| Ideal completion | The output you want the model to produce | A structured summary with key objections, next steps, and sentiment |
The instruction includes system context, user query, and any retrieved documents. The completion is the gold-standard response you want the model to learn.
{"messages": [
{"role": "system", "content": "You are a sales assistant for Acme Corp..."},
{"role": "user", "content": "Prep me for the Globex renewal call tomorrow."},
{"role": "assistant", "content": "## Call Prep: Globex Renewal\n\n**Account health:** At risk..."}
]}Mining Training Data from Enterprise Sources
You already have training data — it's hiding in your existing systems. Here's where to look:
CRM Notes to Instruction Pairs
Sales reps write deal notes after every call. Pair the raw transcript with the rep's polished summary:
Support Tickets to Classification Pairs
Support tickets have categories, priorities, and resolutions assigned by humans:
{"category": "billing", "urgency": "high", "suggested_action": "escalate to AM"}Transcripts to Summarization Pairs
Call recordings get transcribed. The meeting recap email that followed is your completion:
Product Docs to Q&A Pairs
Take your FAQ docs and product specs, generate natural questions a rep might ask, pair with the expert answer:
Quality Scoring
Not all training pairs are created equal. Score each one on four dimensions:
| Dimension | What to Check | Red Flag |
|---|---|---|
| Relevance | Does this reflect a real task the Sales Companion handles? | Generic examples that don't match actual usage |
| Diversity | Does your dataset cover the full range of tasks? | 80% of pairs are the same task type |
| Difficulty | Mix of easy, medium, and hard examples? | All trivial "lookup" questions, no reasoning |
| Correctness | Is the completion actually right? | Outdated pricing, wrong product names |
Aim for at least 200 high-quality pairs to see meaningful improvement. 500-1000 is the sweet spot for most enterprise fine-tuning jobs.
Common Pitfalls
Data Leakage
If your training data contains the same examples as your eval set, your metrics will look incredible and mean nothing. Always split data *before* any augmentation or cleaning.
Label Noise
CRM notes written by tired reps at 6 PM on Friday are not gold-standard completions. Have a second person review at least a random 10% sample. If more than 15% of your labels are wrong, fix the labeling process before training.
Class Imbalance
If 90% of your training pairs are "summarize this call" and 5% are "generate a battlecard," the model will be great at summaries and terrible at battlecards. Oversample rare task types or cap common ones.
Stale Data
Training on last year's pricing, org chart, or product features teaches the model to be confidently wrong. Timestamp your training pairs and exclude anything older than your freshness threshold.
This is chapter 1 of Fine-tuning for Enterprise AI.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details