4 min

Dataset Curation

From Raw Data to Training Pairs

Why Fine-tune on Top of RAG?

Your Sales Companion already retrieves relevant documents and feeds them to the LLM. That works well for factual recall — "What's the pricing for Plan X?" — but breaks down in three specific scenarios:

Domain terminology — Your company calls it "ARR uplift," not "revenue increase." RAG retrieves the right doc, but the model still phrases its answer like a Wikipedia article instead of a sales deck.

Company tone and voice — Every org has a way of communicating. Formal? Challenger-sale aggressive? Consultative? Prompting can nudge this, but fine-tuning *bakes it in*.

Structured output formats — When reps need a competitive battlecard, a call prep brief, or a deal summary in a specific JSON schema, fine-tuning teaches the model the exact output shape so it nails it every time without elaborate prompt engineering.

The rule of thumb: RAG handles *what* the model knows. Fine-tuning handles *how* the model communicates.

Training Pair Anatomy

Every fine-tuning example is a pair:

Component	What It Is	Sales Companion Example
Instruction	The input the model receives	"Summarize this call transcript for the deal review meeting"
Ideal completion	The output you want the model to produce	A structured summary with key objections, next steps, and sentiment

The instruction includes system context, user query, and any retrieved documents. The completion is the gold-standard response you want the model to learn.

{"messages": [
  {"role": "system", "content": "You are a sales assistant for Acme Corp..."},
  {"role": "user", "content": "Prep me for the Globex renewal call tomorrow."},
  {"role": "assistant", "content": "## Call Prep: Globex Renewal\n\n**Account health:** At risk..."}
]}

Mining Training Data from Enterprise Sources

You already have training data — it's hiding in your existing systems. Here's where to look:

CRM Notes to Instruction Pairs

Sales reps write deal notes after every call. Pair the raw transcript with the rep's polished summary:

Instruction: "Summarize this call transcript, highlighting objections and next steps."

Completion: The rep's actual CRM note (cleaned up)

Support Tickets to Classification Pairs

Support tickets have categories, priorities, and resolutions assigned by humans:

Instruction: "Classify this customer inquiry by category and urgency."

Completion: {"category": "billing", "urgency": "high", "suggested_action": "escalate to AM"}

Transcripts to Summarization Pairs

Call recordings get transcribed. The meeting recap email that followed is your completion:

Instruction: "Write a meeting recap from this transcript."

Completion: The actual recap email the rep sent

Product Docs to Q&A Pairs

Take your FAQ docs and product specs, generate natural questions a rep might ask, pair with the expert answer:

Instruction: "How does our enterprise SSO integration work?"

Completion: The answer from your product team, in your company's voice

Quality Scoring

Not all training pairs are created equal. Score each one on four dimensions:

Dimension	What to Check	Red Flag
Relevance	Does this reflect a real task the Sales Companion handles?	Generic examples that don't match actual usage
Diversity	Does your dataset cover the full range of tasks?	80% of pairs are the same task type
Difficulty	Mix of easy, medium, and hard examples?	All trivial "lookup" questions, no reasoning
Correctness	Is the completion actually right?	Outdated pricing, wrong product names

Aim for at least 200 high-quality pairs to see meaningful improvement. 500-1000 is the sweet spot for most enterprise fine-tuning jobs.

Common Pitfalls

Data Leakage

If your training data contains the same examples as your eval set, your metrics will look incredible and mean nothing. Always split data *before* any augmentation or cleaning.

Label Noise

CRM notes written by tired reps at 6 PM on Friday are not gold-standard completions. Have a second person review at least a random 10% sample. If more than 15% of your labels are wrong, fix the labeling process before training.

Class Imbalance

If 90% of your training pairs are "summarize this call" and 5% are "generate a battlecard," the model will be great at summaries and terrible at battlecards. Oversample rare task types or cap common ones.

Stale Data

Training on last year's pricing, org chart, or product features teaches the model to be confidently wrong. Timestamp your training pairs and exclude anything older than your freshness threshold.

This is chapter 1 of Fine-tuning for Enterprise AI.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 2: Data Preparation