Back to guides
3
5 min

Types of Models

The AI Taxonomy

"AI Model" Is Like "Vehicle"

Saying "I need an AI model" is like saying "I need a vehicle." A bicycle, a pickup truck, and a freight train are all vehicles — but you'd never use them interchangeably. The same applies to AI models. Each type takes different inputs, produces different outputs, and excels at different tasks. Picking the wrong type wastes money and delivers poor results.

Loading diagram...

Foundation Models / LLMs

Input: text. Output: text.

These are the models most people interact with — Claude, GPT-4, Llama, Gemini. Under the hood, they use the Transformer architecture, which introduced a mechanism called attention: for every word in the input, the model calculates how much it should "attend to" every other word.

Why does attention matter? Consider: "The bank by the river was muddy." Without attention, a model processes "bank" the same way regardless of context. With attention, "bank" focuses on "river" and "muddy" to determine this means a riverbank, not a financial institution. This context-awareness is what makes Transformers dramatically better than earlier architectures.

Practical distinctions:

  • Closed-source (Claude, GPT-4, Gemini) — you call an API, pay per token, can't see the weights
  • Open-weight (Llama, Mistral, Qwen) — download and run on your own hardware, full control, no per-token cost but you pay for compute
  • Size tradeoffs — 7B parameter models run on a laptop; 70B needs a serious GPU; 400B+ needs a cluster
  • Embedding Models

    Input: text. Output: a list of numbers (a vector / coordinates in high-dimensional space).

    Embedding models don't generate text — they place text on a map. Similar meanings land near each other. This is how "King - Man + Woman ≈ Queen" works: the model learned that the *direction* from Man to Woman is the same as the direction from King to Queen in embedding space.

    Why this matters practically:

  • Semantic search — instead of matching keywords, find documents that *mean* the same thing as the query
  • RAG retrieval — embed your documents, embed the user's question, find the closest matches
  • Clustering — group customer feedback by topic without manually reading thousands of messages
  • Anomaly detection — flag support tickets that are semantically unusual
  • Popular embedding models: OpenAI's text-embedding-3, Cohere Embed, BGE, E5. They're cheap to run (a fraction of the cost of an LLM call) and critical infrastructure for any system that needs to *find* relevant information.

    Vision Models

    Input: images (pixels). Output: classifications, bounding boxes, segmentation masks, or descriptions.

    Early vision models used Convolutional Neural Networks (CNNs) — layers that slide small filters across the image to detect edges, then shapes, then objects. Modern vision models increasingly use Vision Transformers (ViTs), which split the image into patches and apply the same attention mechanism that works so well for text.

    Three main tasks:

  • Classification — "This image contains a golden retriever" (one label per image)
  • Object detection — "There's a car at coordinates (120,40) and a person at (300,180)" (labels + locations)
  • Segmentation — color every pixel by what object it belongs to (pixel-level precision)
  • Enterprise uses: quality inspection on manufacturing lines, medical imaging analysis, document layout understanding, retail inventory tracking from shelf photos.

    Multimodal Models

    Input: text + images + audio (any combination). Output: text, images, or audio.

    The real world isn't text-only, and neither are the latest models. GPT-4o, Gemini, and Claude can process images alongside text. You can upload a photo of a whiteboard and ask the model to convert the diagram to code, or show it a chart and ask for analysis.

    This is a significant architectural shift. Instead of building separate vision and language models and stitching them together, multimodal models learn a shared representation space where images and text coexist. A photo of a cat and the words "a photo of a cat" land in similar regions of the model's internal space.

    Why this matters: It eliminates brittle pipelines. Instead of OCR → text cleanup → LLM, you pass the document image directly to a multimodal model. Instead of describing a UI mockup in words, you screenshot it and say "build this."

    Diffusion Models

    Input: a text prompt (+ noise). Output: an image.

    Diffusion models work backwards from chaos. Start with pure static (random noise), and the model iteratively *removes* noise, guided by your text prompt. Each step makes the image slightly clearer, nudging it toward something that matches "a golden retriever wearing sunglasses on a beach."

    The training process teaches the model to reverse noise: take a real image, add a known amount of noise, and train the model to predict what the original looked like. Do this at every noise level and the model learns to go from pure static to coherent images.

    Key players: DALL-E 3 (OpenAI), Stable Diffusion (Stability AI, open-source), Midjourney. Each has different strengths — Midjourney excels at artistic quality, Stable Diffusion offers the most control and customization, DALL-E integrates tightly with ChatGPT.

    Enterprise uses: product mockups, marketing visuals, design prototyping, synthetic training data for vision models.

    Specialized Models

    Some models are purpose-built for a single modality or task:

    ModelInputOutputUse Case
    **Whisper** (OpenAI)AudioTextTranscription, meeting notes, subtitles
    ElevenLabsTextAudioVoiceovers, audiobooks, IVR systems
    Codex / Code LlamaCode + commentsCodeCode generation, completion, refactoring
    Suno / UdioText promptMusicBackground tracks, jingles, prototyping

    These specialized models typically outperform general-purpose models on their specific task because they're trained exclusively on domain-specific data. Whisper trained on 680,000 hours of audio transcription data — it's going to transcribe better than a general LLM hearing audio for the first time.

    Picking the Right Model

    The decision tree is simpler than it looks:

  • What's your input? Text only → LLM or embedding model. Images → vision or multimodal. Audio → Whisper.
  • What do you need back? Generated text → LLM. Similar items → embedding model. Generated images → diffusion.
  • What's your budget? Closed-source APIs for fast iteration. Open-weight for scale or data privacy.
  • How specialized is the task? General knowledge → foundation model. Domain-specific → fine-tuned or specialized model.
  • The practical rule: Start with a foundation model API (Claude, GPT-4). It handles 80% of use cases. Only reach for specialized models when you hit a performance ceiling, a cost wall, or a data privacy requirement that rules out APIs.

    This is chapter 3 of AI Models Demystified.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details