Back to guides
1
4 min

What Is a Model, Really?

Pattern Machines & Parameters

Not Magic — Pattern Recognition

When someone says "AI model," picture a student who's read a billion essays and can now write one that *sounds* right — without understanding a word. That's the core insight: an AI model is a pattern recognizer trained on examples. Show it 10,000 photos labeled "cat" and 10,000 labeled "dog," and it learns to spot the patterns that distinguish them — ear shapes, fur textures, snout proportions. It never "knows" what a cat is. It knows what cat-photos look like, statistically.

This matters because it sets your expectations correctly. Models are extraordinary at pattern-matching. They're terrible at reasoning from first principles, understanding causation, or knowing when they're wrong.

Everything Becomes Numbers

Here's the pipeline that turns your text into something a model can process:

Loading diagram...

Text → tokens → numbers → math → numbers → text. The model never sees letters or words. It operates entirely in a mathematical space where "king" is a list of 4,096 numbers (an *embedding*) and "queen" is a nearby list of 4,096 numbers. The model does matrix multiplication — billions of multiply-and-add operations — to transform input numbers into output numbers.

Parameters: The Model's Learned Patterns

A parameter is a single number the model learned during training. Think of it as one dial on a mixing board with billions of dials. Each dial controls how much attention the model pays to a specific pattern.

Scale matters enormously:

ModelParametersWhat It Can Do
Logistic regression2"Is this email spam?" (yes/no)
BERT110 millionUnderstand sentence meaning, answer questions
GPT-3175 billionWrite essays, code, translate languages
GPT-4~1.8 trillionComplex reasoning, nuanced analysis, multimodal

Going from 2 parameters to 1.8 trillion isn't just "bigger." It's qualitatively different — like comparing a calculator to a research library. More parameters means the model can store more patterns, capture subtler relationships, and handle more complex tasks. But it also means more compute, more memory, and more cost per query.

Next-Token Prediction: The One Trick

Every modern LLM runs on the same simple idea: given these words, what word comes next?

Feed it "The capital of France is" and it assigns probabilities: "Paris" (92%), "located" (3%), "a" (1%), etc. It picks one (usually the most likely) and appends it. Then it runs again with "The capital of France is Paris" as input, predicting the *next* next token. Repeat thousands of times and you get an essay, a poem, or a block of code.

This is why the term "large language model" is slightly misleading. It's not a knowledge database. It's a next-token predictor that happens to encode an enormous amount of world knowledge in its parameters because that knowledge helps predict the next word.

Why Models Hallucinate

If you ask "Who won the 2028 Olympics 100m dash?" a model will confidently name someone — probably a plausible sprinter. It's not lying. It's doing the only thing it knows how to do: completing the pattern. "Who won [event]?" patterns in the training data are always followed by a name, so the model produces a name.

Hallucination isn't a bug — it's the default behavior. The model always completes the pattern. It has no mechanism to say "I don't have reliable information about this." Every technique to reduce hallucination (RAG, chain-of-thought, tool use) is essentially adding guardrails around this fundamental behavior.

Tokens: How Models See Text

Models don't process words or characters — they process tokens, which are subword pieces. The tokenizer splits text into chunks that balance vocabulary size with coverage:

  • "hello" → 1 token
  • "strawberry" → 3 tokens: "str" + "aw" + "berry"
  • "antidisestablishmentarianism" → 6 tokens
  • This is why models famously struggle to count letters in "strawberry" — they literally never see the individual letters. They see three chunks and process each as a unit. It's also why API pricing is per-token, not per-word. A 1,000-word prompt might be 1,300 tokens (English averages ~1.3 tokens per word).

    So what? Token awareness matters for three practical reasons: (1) you're billed per token, (2) context windows are measured in tokens, and (3) the tokenizer's splits affect what the model can "see" in your input.

    This is chapter 1 of AI Models Demystified.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details