Types of Models
The AI Taxonomy
"AI Model" Is Like "Vehicle"
Saying "I need an AI model" is like saying "I need a vehicle." A bicycle, a pickup truck, and a freight train are all vehicles — but you'd never use them interchangeably. The same applies to AI models. Each type takes different inputs, produces different outputs, and excels at different tasks. Picking the wrong type wastes money and delivers poor results.
Foundation Models / LLMs
Input: text. Output: text.
These are the models most people interact with — Claude, GPT-4, Llama, Gemini. Under the hood, they use the Transformer architecture, which introduced a mechanism called attention: for every word in the input, the model calculates how much it should "attend to" every other word.
Why does attention matter? Consider: "The bank by the river was muddy." Without attention, a model processes "bank" the same way regardless of context. With attention, "bank" focuses on "river" and "muddy" to determine this means a riverbank, not a financial institution. This context-awareness is what makes Transformers dramatically better than earlier architectures.
Practical distinctions:
Embedding Models
Input: text. Output: a list of numbers (a vector / coordinates in high-dimensional space).
Embedding models don't generate text — they place text on a map. Similar meanings land near each other. This is how "King - Man + Woman ≈ Queen" works: the model learned that the *direction* from Man to Woman is the same as the direction from King to Queen in embedding space.
Why this matters practically:
Popular embedding models: OpenAI's text-embedding-3, Cohere Embed, BGE, E5. They're cheap to run (a fraction of the cost of an LLM call) and critical infrastructure for any system that needs to *find* relevant information.
Vision Models
Input: images (pixels). Output: classifications, bounding boxes, segmentation masks, or descriptions.
Early vision models used Convolutional Neural Networks (CNNs) — layers that slide small filters across the image to detect edges, then shapes, then objects. Modern vision models increasingly use Vision Transformers (ViTs), which split the image into patches and apply the same attention mechanism that works so well for text.
Three main tasks:
Enterprise uses: quality inspection on manufacturing lines, medical imaging analysis, document layout understanding, retail inventory tracking from shelf photos.
Multimodal Models
Input: text + images + audio (any combination). Output: text, images, or audio.
The real world isn't text-only, and neither are the latest models. GPT-4o, Gemini, and Claude can process images alongside text. You can upload a photo of a whiteboard and ask the model to convert the diagram to code, or show it a chart and ask for analysis.
This is a significant architectural shift. Instead of building separate vision and language models and stitching them together, multimodal models learn a shared representation space where images and text coexist. A photo of a cat and the words "a photo of a cat" land in similar regions of the model's internal space.
Why this matters: It eliminates brittle pipelines. Instead of OCR → text cleanup → LLM, you pass the document image directly to a multimodal model. Instead of describing a UI mockup in words, you screenshot it and say "build this."
Diffusion Models
Input: a text prompt (+ noise). Output: an image.
Diffusion models work backwards from chaos. Start with pure static (random noise), and the model iteratively *removes* noise, guided by your text prompt. Each step makes the image slightly clearer, nudging it toward something that matches "a golden retriever wearing sunglasses on a beach."
The training process teaches the model to reverse noise: take a real image, add a known amount of noise, and train the model to predict what the original looked like. Do this at every noise level and the model learns to go from pure static to coherent images.
Key players: DALL-E 3 (OpenAI), Stable Diffusion (Stability AI, open-source), Midjourney. Each has different strengths — Midjourney excels at artistic quality, Stable Diffusion offers the most control and customization, DALL-E integrates tightly with ChatGPT.
Enterprise uses: product mockups, marketing visuals, design prototyping, synthetic training data for vision models.
Specialized Models
Some models are purpose-built for a single modality or task:
| Model | Input | Output | Use Case |
|---|---|---|---|
| **Whisper** (OpenAI) | Audio | Text | Transcription, meeting notes, subtitles |
| ElevenLabs | Text | Audio | Voiceovers, audiobooks, IVR systems |
| Codex / Code Llama | Code + comments | Code | Code generation, completion, refactoring |
| Suno / Udio | Text prompt | Music | Background tracks, jingles, prototyping |
These specialized models typically outperform general-purpose models on their specific task because they're trained exclusively on domain-specific data. Whisper trained on 680,000 hours of audio transcription data — it's going to transcribe better than a general LLM hearing audio for the first time.
Picking the Right Model
The decision tree is simpler than it looks:
The practical rule: Start with a foundation model API (Claude, GPT-4). It handles 80% of use cases. Only reach for specialized models when you hit a performance ceiling, a cost wall, or a data privacy requirement that rules out APIs.
This is chapter 3 of AI Models Demystified.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details