Data Preparation
Format, Split, Budget
JSONL Format for Fine-tuning
Every major fine-tuning provider expects JSONL — one JSON object per line, each representing a single training example. The format is nearly identical across OpenAI, Anthropic, and HuggingFace:
{"messages": [{"role": "system", "content": "You are a sales assistant for Acme Corp. Use a consultative tone."}, {"role": "user", "content": "What's our win rate against Globex this quarter?"}, {"role": "assistant", "content": "Based on the latest pipeline data, our win rate against Globex is 34% this quarter, down from 41% last quarter. The primary loss reason has been pricing — they're undercutting on 2-year deals. I'd recommend leading with our ROI calculator to shift the conversation from price to value."}]}
{"messages": [{"role": "system", "content": "You are a sales assistant for Acme Corp. Use a consultative tone."}, {"role": "user", "content": "Draft a follow-up email after the Initech demo."}, {"role": "assistant", "content": "Subject: Next Steps After Today's Demo\n\nHi Sarah,\n\nGreat conversation today..."}]}System/User/Assistant Structure
For multi-turn conversations, alternate user/assistant messages. The model learns from the *last* assistant message in each example.
Train/Validation Splits
Split your data before any cleaning or augmentation:
| Split | Purpose | Typical Ratio |
|---|---|---|
| Train | Model learns from these | 90% |
| Validation | Monitors overfitting during training | 10% |
| **Test** (held out) | Final evaluation after training is done | Separate set entirely |
Stratified Splitting
If your dataset has multiple task types (summarization, classification, Q&A, battlecard generation), split *within each type* so both train and validation sets have proportional representation:
from sklearn.model_selection import train_test_split
train, val = train_test_split(
dataset,
test_size=0.1,
stratify=[example["task_type"] for example in dataset],
random_state=42
)A validation set with zero battlecard examples won't catch overfitting on that task.
Token Counting and Cost Estimation
Fine-tuning is priced per token per epoch. Before you commit money, count your tokens:
import tiktoken
enc = tiktoken.encoding_for_model("gpt-4o")
total_tokens = sum(
len(enc.encode(str(example["messages"])))
for example in training_data
)
cost_per_epoch = (total_tokens / 1000) * price_per_1k_tokens
total_cost = cost_per_epoch * num_epochs
print(f"Tokens: {total_tokens:,}")
print(f"Cost for {num_epochs} epochs: ${total_cost:.2f}")Example budget for the Sales Companion:
That's the training cost. Inference on your fine-tuned model also has a per-token price — usually 1.5-2x the base model rate. Factor that into your production cost model.
Data Augmentation Techniques
When you don't have enough examples, augment carefully:
Paraphrasing
Take existing user queries and rephrase them. "What's our pricing?" becomes "How much do we charge?" and "Walk me through the pricing tiers." The completion stays the same.
Few-shot Expansion
Use a strong base model to generate new training pairs from a few examples. Give it 5 real CRM-note pairs, ask it to generate 20 more in the same style. Always human-review generated pairs — they're a starting point, not a finished product.
Context Variation
Same question, different retrieved documents. If your RAG system might surface different chunks for the same query, create training pairs with each context variant.
Deduplication and Near-duplicate Detection
Duplicate or near-duplicate examples waste training budget and can cause the model to memorize rather than generalize.
Exact deduplication: Hash each example's messages and remove duplicates.
Near-duplicate detection: Compute embeddings for each example and flag pairs with cosine similarity > 0.95. Review flagged pairs and keep the better one.
from collections import defaultdict
import hashlib
seen = set()
deduped = []
for example in training_data:
h = hashlib.sha256(str(example["messages"]).encode()).hexdigest()
if h not in seen:
seen.add(h)
deduped.append(example)
print(f"Removed {len(training_data) - len(deduped)} exact duplicates")A clean, deduplicated dataset of 400 examples will outperform a noisy dataset of 1,000.
This is chapter 2 of Fine-tuning for Enterprise AI.
Get the full hands-on course for $100 and build the complete system. Your projects become your portfolio.
View course details