Back to guides
2
3 min

Data Preparation

Format, Split, Budget

JSONL Format for Fine-tuning

Every major fine-tuning provider expects JSONL — one JSON object per line, each representing a single training example. The format is nearly identical across OpenAI, Anthropic, and HuggingFace:

{"messages": [{"role": "system", "content": "You are a sales assistant for Acme Corp. Use a consultative tone."}, {"role": "user", "content": "What's our win rate against Globex this quarter?"}, {"role": "assistant", "content": "Based on the latest pipeline data, our win rate against Globex is 34% this quarter, down from 41% last quarter. The primary loss reason has been pricing — they're undercutting on 2-year deals. I'd recommend leading with our ROI calculator to shift the conversation from price to value."}]}
{"messages": [{"role": "system", "content": "You are a sales assistant for Acme Corp. Use a consultative tone."}, {"role": "user", "content": "Draft a follow-up email after the Initech demo."}, {"role": "assistant", "content": "Subject: Next Steps After Today's Demo\n\nHi Sarah,\n\nGreat conversation today..."}]}

System/User/Assistant Structure

  • System message — Sets persona, company context, and behavioral guardrails. Keep this consistent across most examples so the model learns your baseline behavior.
  • User message — The rep's actual query. Include any retrieved context (RAG results) here, just like your Sales Companion does in production.
  • Assistant message — The ideal response. This is what you're training the model to produce.
  • For multi-turn conversations, alternate user/assistant messages. The model learns from the *last* assistant message in each example.

    Train/Validation Splits

    Split your data before any cleaning or augmentation:

    SplitPurposeTypical Ratio
    TrainModel learns from these90%
    ValidationMonitors overfitting during training10%
    **Test** (held out)Final evaluation after training is doneSeparate set entirely

    Stratified Splitting

    If your dataset has multiple task types (summarization, classification, Q&A, battlecard generation), split *within each type* so both train and validation sets have proportional representation:

    from sklearn.model_selection import train_test_split
    
    train, val = train_test_split(
        dataset,
        test_size=0.1,
        stratify=[example["task_type"] for example in dataset],
        random_state=42
    )

    A validation set with zero battlecard examples won't catch overfitting on that task.

    Token Counting and Cost Estimation

    Fine-tuning is priced per token per epoch. Before you commit money, count your tokens:

    import tiktoken
    
    enc = tiktoken.encoding_for_model("gpt-4o")
    total_tokens = sum(
        len(enc.encode(str(example["messages"])))
        for example in training_data
    )
    
    cost_per_epoch = (total_tokens / 1000) * price_per_1k_tokens
    total_cost = cost_per_epoch * num_epochs
    
    print(f"Tokens: {total_tokens:,}")
    print(f"Cost for {num_epochs} epochs: ${total_cost:.2f}")

    Example budget for the Sales Companion:

  • 500 training examples, average 800 tokens each = 400K tokens
  • At $8/1M training tokens (GPT-4o mini fine-tuning): $3.20/epoch
  • 3 epochs = $9.60 total
  • That's the training cost. Inference on your fine-tuned model also has a per-token price — usually 1.5-2x the base model rate. Factor that into your production cost model.

    Data Augmentation Techniques

    When you don't have enough examples, augment carefully:

    Paraphrasing

    Take existing user queries and rephrase them. "What's our pricing?" becomes "How much do we charge?" and "Walk me through the pricing tiers." The completion stays the same.

    Few-shot Expansion

    Use a strong base model to generate new training pairs from a few examples. Give it 5 real CRM-note pairs, ask it to generate 20 more in the same style. Always human-review generated pairs — they're a starting point, not a finished product.

    Context Variation

    Same question, different retrieved documents. If your RAG system might surface different chunks for the same query, create training pairs with each context variant.

    Deduplication and Near-duplicate Detection

    Duplicate or near-duplicate examples waste training budget and can cause the model to memorize rather than generalize.

    Exact deduplication: Hash each example's messages and remove duplicates.

    Near-duplicate detection: Compute embeddings for each example and flag pairs with cosine similarity > 0.95. Review flagged pairs and keep the better one.

    from collections import defaultdict
    import hashlib
    
    seen = set()
    deduped = []
    for example in training_data:
        h = hashlib.sha256(str(example["messages"]).encode()).hexdigest()
        if h not in seen:
            seen.add(h)
            deduped.append(example)
    
    print(f"Removed {len(training_data) - len(deduped)} exact duplicates")

    A clean, deduplicated dataset of 400 examples will outperform a noisy dataset of 1,000.

    This is chapter 2 of Fine-tuning for Enterprise AI.

    Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

    View course details