Back to guides
1
4 min

Dataset Curation

From Raw Data to Training Pairs

Why Fine-tune on Top of RAG?

Your Sales Companion already retrieves relevant documents and feeds them to the LLM. That works well for factual recall — "What's the pricing for Plan X?" — but breaks down in three specific scenarios:

  • Domain terminology — Your company calls it "ARR uplift," not "revenue increase." RAG retrieves the right doc, but the model still phrases its answer like a Wikipedia article instead of a sales deck.
  • Company tone and voice — Every org has a way of communicating. Formal? Challenger-sale aggressive? Consultative? Prompting can nudge this, but fine-tuning *bakes it in*.
  • Structured output formats — When reps need a competitive battlecard, a call prep brief, or a deal summary in a specific JSON schema, fine-tuning teaches the model the exact output shape so it nails it every time without elaborate prompt engineering.
  • The rule of thumb: RAG handles *what* the model knows. Fine-tuning handles *how* the model communicates.

    Training Pair Anatomy

    Every fine-tuning example is a pair:

    ComponentWhat It IsSales Companion Example
    InstructionThe input the model receives"Summarize this call transcript for the deal review meeting"
    Ideal completionThe output you want the model to produceA structured summary with key objections, next steps, and sentiment

    The instruction includes system context, user query, and any retrieved documents. The completion is the gold-standard response you want the model to learn.

    {"messages": [
      {"role": "system", "content": "You are a sales assistant for Acme Corp..."},
      {"role": "user", "content": "Prep me for the Globex renewal call tomorrow."},
      {"role": "assistant", "content": "## Call Prep: Globex Renewal\n\n**Account health:** At risk..."}
    ]}

    Mining Training Data from Enterprise Sources

    You already have training data — it's hiding in your existing systems. Here's where to look:

    CRM Notes to Instruction Pairs

    Sales reps write deal notes after every call. Pair the raw transcript with the rep's polished summary:

  • Instruction: "Summarize this call transcript, highlighting objections and next steps."
  • Completion: The rep's actual CRM note (cleaned up)
  • Support Tickets to Classification Pairs

    Support tickets have categories, priorities, and resolutions assigned by humans:

  • Instruction: "Classify this customer inquiry by category and urgency."
  • Completion: {"category": "billing", "urgency": "high", "suggested_action": "escalate to AM"}
  • Transcripts to Summarization Pairs

    Call recordings get transcribed. The meeting recap email that followed is your completion:

  • Instruction: "Write a meeting recap from this transcript."
  • Completion: The actual recap email the rep sent
  • Product Docs to Q&A Pairs

    Take your FAQ docs and product specs, generate natural questions a rep might ask, pair with the expert answer:

  • Instruction: "How does our enterprise SSO integration work?"
  • Completion: The answer from your product team, in your company's voice
  • Quality Scoring

    Not all training pairs are created equal. Score each one on four dimensions:

    DimensionWhat to CheckRed Flag
    RelevanceDoes this reflect a real task the Sales Companion handles?Generic examples that don't match actual usage
    DiversityDoes your dataset cover the full range of tasks?80% of pairs are the same task type
    DifficultyMix of easy, medium, and hard examples?All trivial "lookup" questions, no reasoning
    CorrectnessIs the completion actually right?Outdated pricing, wrong product names

    Aim for at least 200 high-quality pairs to see meaningful improvement. 500-1000 is the sweet spot for most enterprise fine-tuning jobs.

    Common Pitfalls

    Data Leakage

    If your training data contains the same examples as your eval set, your metrics will look incredible and mean nothing. Always split data *before* any augmentation or cleaning.

    Label Noise

    CRM notes written by tired reps at 6 PM on Friday are not gold-standard completions. Have a second person review at least a random 10% sample. If more than 15% of your labels are wrong, fix the labeling process before training.

    Class Imbalance

    If 90% of your training pairs are "summarize this call" and 5% are "generate a battlecard," the model will be great at summaries and terrible at battlecards. Oversample rare task types or cap common ones.

    Stale Data

    Training on last year's pricing, org chart, or product features teaches the model to be confidently wrong. Timestamp your training pairs and exclude anything older than your freshness threshold.

    This is chapter 1 of Fine-tuning for Enterprise AI.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details