3 min

Data Preparation

Format, Split, Budget

JSONL Format for Fine-tuning

Every major fine-tuning provider expects JSONL — one JSON object per line, each representing a single training example. The format is nearly identical across OpenAI, Anthropic, and HuggingFace:

{"messages": [{"role": "system", "content": "You are a sales assistant for Acme Corp. Use a consultative tone."}, {"role": "user", "content": "What's our win rate against Globex this quarter?"}, {"role": "assistant", "content": "Based on the latest pipeline data, our win rate against Globex is 34% this quarter, down from 41% last quarter. The primary loss reason has been pricing — they're undercutting on 2-year deals. I'd recommend leading with our ROI calculator to shift the conversation from price to value."}]}
{"messages": [{"role": "system", "content": "You are a sales assistant for Acme Corp. Use a consultative tone."}, {"role": "user", "content": "Draft a follow-up email after the Initech demo."}, {"role": "assistant", "content": "Subject: Next Steps After Today's Demo\n\nHi Sarah,\n\nGreat conversation today..."}]}

System/User/Assistant Structure

System message — Sets persona, company context, and behavioral guardrails. Keep this consistent across most examples so the model learns your baseline behavior.

User message — The rep's actual query. Include any retrieved context (RAG results) here, just like your Sales Companion does in production.

Assistant message — The ideal response. This is what you're training the model to produce.

For multi-turn conversations, alternate user/assistant messages. The model learns from the *last* assistant message in each example.

Train/Validation Splits

Split your data before any cleaning or augmentation:

Split	Purpose	Typical Ratio
Train	Model learns from these	90%
Validation	Monitors overfitting during training	10%
Test (held out)	Final evaluation after training is done	Separate set entirely

Stratified Splitting

If your dataset has multiple task types (summarization, classification, Q&A, battlecard generation), split *within each type* so both train and validation sets have proportional representation:

from sklearn.model_selection import train_test_split

train, val = train_test_split(
    dataset,
    test_size=0.1,
    stratify=[example["task_type"] for example in dataset],
    random_state=42
)

A validation set with zero battlecard examples won't catch overfitting on that task.

Token Counting and Cost Estimation

Fine-tuning is priced per token per epoch. Before you commit money, count your tokens:

import tiktoken

enc = tiktoken.encoding_for_model("gpt-4o")
total_tokens = sum(
    len(enc.encode(str(example["messages"])))
    for example in training_data
)

cost_per_epoch = (total_tokens / 1000) * price_per_1k_tokens
total_cost = cost_per_epoch * num_epochs

print(f"Tokens: {total_tokens:,}")
print(f"Cost for {num_epochs} epochs: ${total_cost:.2f}")

Example budget for the Sales Companion:

500 training examples, average 800 tokens each = 400K tokens

At $8/1M training tokens (GPT-4o mini fine-tuning): $3.20/epoch

3 epochs = $9.60 total

That's the training cost. Inference on your fine-tuned model also has a per-token price — usually 1.5-2x the base model rate. Factor that into your production cost model.

Data Augmentation Techniques

When you don't have enough examples, augment carefully:

Paraphrasing

Take existing user queries and rephrase them. "What's our pricing?" becomes "How much do we charge?" and "Walk me through the pricing tiers." The completion stays the same.

Few-shot Expansion

Use a strong base model to generate new training pairs from a few examples. Give it 5 real CRM-note pairs, ask it to generate 20 more in the same style. Always human-review generated pairs — they're a starting point, not a finished product.

Context Variation

Same question, different retrieved documents. If your RAG system might surface different chunks for the same query, create training pairs with each context variant.

Deduplication and Near-duplicate Detection

Duplicate or near-duplicate examples waste training budget and can cause the model to memorize rather than generalize.

Exact deduplication: Hash each example's messages and remove duplicates.

Near-duplicate detection: Compute embeddings for each example and flag pairs with cosine similarity > 0.95. Review flagged pairs and keep the better one.

from collections import defaultdict
import hashlib

seen = set()
deduped = []
for example in training_data:
    h = hashlib.sha256(str(example["messages"]).encode()).hexdigest()
    if h not in seen:
        seen.add(h)
        deduped.append(example)

print(f"Removed {len(training_data) - len(deduped)} exact duplicates")

A clean, deduplicated dataset of 400 examples will outperform a noisy dataset of 1,000.

This is chapter 2 of Fine-tuning for Enterprise AI.

Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.

View course details

Ch. 1: Dataset Curation

Ch. 3: Fine-tuning Execution