How Models Learn
Training, Loss & Fine-Tuning
Training = Showing Examples and Adjusting Dials
Imagine tuning an old radio. You twist the dial, hear static, twist a little more, and suddenly the signal gets clearer. You keep nudging until the music comes through. Training a model works the same way — except instead of one dial, you have billions, and instead of your ear judging the signal, a mathematical function measures how wrong the model is.
Here's the loop: (1) show the model a batch of examples, (2) measure how wrong its predictions are, (3) nudge every parameter slightly in the direction that reduces the error, (4) repeat millions of times. That's it. The entire field of deep learning is variations on this loop.
The Loss Function: "How Wrong Was I?"
The loss function collapses all of the model's mistakes into a single number. High loss = very wrong. Low loss = getting it right. Training is the process of minimizing this number.
For a next-token predictor, the loss is straightforward: the model predicted "Paris" with 40% confidence, but the correct answer was "Paris." That's a loss of -log(0.4) ≈ 0.92. If it predicted "Paris" with 99% confidence, loss drops to 0.01. Over billions of examples, the model learns to assign high probability to the correct next token.
Why this matters practically: When you see a model's training loss curve going down, it means the model is getting better at predicting its training data. But that alone doesn't tell you if it's *useful* — it might just be memorizing.
Gradient Descent: Rolling Downhill
How does the model know *which direction* to nudge each parameter? Gradient descent. Picture a ball on a hilly landscape. The ball rolls downhill — toward lower loss. The gradient tells you the slope at your current position: which direction is downhill, and how steep it is.
Each parameter gets its own gradient — its own "which way is downhill." Multiply that by a small step size, and you get the update. Do this for all 1.8 trillion parameters simultaneously, and you've completed one training step. Modern models require millions of these steps.
This isn't random guessing or brute force. Calculus (specifically, the chain rule via backpropagation) computes the exact gradient for every parameter in one backward pass through the network. It's why training is feasible at all.
The Human-Tuned Knobs
The model's parameters are learned automatically, but humans set the *hyperparameters* — the knobs that control the learning process:
| Knob | What It Controls | Too Low | Too High |
|---|---|---|---|
| Learning rate | Step size per update | Learns too slowly, gets stuck | Overshoots, never converges |
| Batch size | Examples per step | Noisy gradients, unstable | Needs more memory, less exploration |
| Epochs | Passes through data | Underfitting, hasn't learned enough | Overfitting, memorized too much |
Learning rate is the most critical. Too high and the model bounces around wildly. Too low and training takes forever. Most teams use a schedule: start with a higher rate to make progress quickly, then lower it to fine-tune.
Overfitting: Memorizing the Exam
A student who memorizes every practice exam answer gets 100% on practice tests but bombs the real exam. That's overfitting — the model learns the training data so well that it fails on new data.
The telltale sign: training accuracy is 99% but test accuracy is 62%. The model didn't learn the underlying patterns; it memorized specific examples. "This exact sentence should be followed by this exact word" instead of "sentences about geography tend to be followed by place names."
Defenses against overfitting include using more diverse training data, stopping training before the test accuracy starts dropping (early stopping), and techniques like dropout (randomly disabling parameters during training to force redundancy).
Pre-training vs Fine-tuning: Two Stages
Training a foundation model happens in two stages:
Pre-training is the expensive phase. Feed the model the entire internet (or a curated slice of it) and train it to predict the next token. This takes months on thousands of GPUs and costs millions of dollars. The result is a *base model* — a powerful autocomplete engine that can finish any text pattern but has no notion of being helpful, following instructions, or refusing harmful requests.
Fine-tuning is the targeted phase. Take the base model and train it further on a smaller, curated dataset — medical records for a healthcare model, legal briefs for a legal model, instruction-response pairs for a chatbot. This is like a medical student who already speaks fluent English (pre-training) now learning medical terminology and clinical reasoning (fine-tuning). It's cheaper, faster, and requires far less data.
RLHF: From Autocomplete to Assistant
Base models are impressive but chaotic. Ask "How do I break into a car?" and a base model will helpfully provide instructions — it's just completing the pattern. This is where Reinforcement Learning from Human Feedback (RLHF) comes in.
The process: (1) generate multiple responses to a prompt, (2) have humans rank them from best to worst, (3) train a "reward model" that predicts human preferences, (4) use that reward model to further train the language model toward responses humans prefer.
RLHF is what turned "autocomplete on steroids" into "helpful, harmless, and honest assistant." It's also why models sometimes refuse to answer questions, hedge excessively, or add unnecessary disclaimers — the human raters who provided feedback generally preferred cautious responses. The model learned that pattern too.
The practical takeaway: When you use Claude, ChatGPT, or Gemini, you're using a model that went through all three stages — pre-trained on text, fine-tuned on instructions, and aligned via human feedback. Understanding this pipeline explains most of the model's strengths (fluent, helpful, broad knowledge) and weaknesses (hallucination, sycophancy, knowledge cutoff).
This is chapter 2 of AI Models Demystified.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details