Back to guides
2
14 min

How Embedding Models Work

From Text to Vector

The Embedding Pipeline

Every embedding model follows the same basic pipeline: tokenize → encode → pool → normalize. Understanding each step helps you make better decisions about which model to use and how to prepare your data.

Loading diagram...

Step 1: Tokenization

Text is split into tokens — subword units that the model understands. "Embeddings" might become ["em", "bed", "ding", "s"]. Different models use different tokenizers:

TokenizerUsed ByVocab Size
**BPE** (Byte Pair Encoding)OpenAI, GPT-family50K-100K
WordPieceBERT, E530K
SentencePieceT5, multilingual models32K-256K

Tokenization matters because it determines the model's context window — how many tokens it can process at once. Longer text = more tokens = potential truncation.

Step 2: Transformer Encoding

Each token is converted into a contextualized vector. The word "bank" gets a different vector in "river bank" versus "savings bank" because the transformer considers surrounding tokens through self-attention.

This is the critical difference from older approaches like Word2Vec, which gave "bank" the same vector regardless of context.

Step 3: Pooling

The transformer produces one vector per token, but we need a single vector for the entire input. Pooling strategies:

  • CLS token — Use the special [CLS] token's vector (BERT-style). Fast but sometimes misses information from later tokens.
  • Mean pooling — Average all token vectors. Most common and generally best. Used by Sentence-BERT, E5, and most modern models.
  • Last token — Use the final token's vector. Used by some GPT-based embedding models.
  • Step 4: Normalization

    The output vector is scaled to unit length (magnitude = 1.0). This ensures cosine similarity equals dot product, simplifying and speeding up search.

    Dimensionality: How Many Numbers?

    Embedding dimension is a key design choice:

    ModelDimensionsContext Window
    OpenAI text-embedding-3-small1,5368,191 tokens
    OpenAI text-embedding-3-large3,0728,191 tokens
    Cohere embed-v31,024512 tokens
    Voyage AI voyage-31,02432,000 tokens
    **BGE-large-en-v1.5** (open)1,024512 tokens
    **E5-mistral-7b** (open)4,09632,768 tokens
    **nomic-embed-text** (open)7688,192 tokens

    More dimensions = more nuance in meaning representation, but also more storage (each float32 = 4 bytes) and slower search.

    Matryoshka embeddings (supported by OpenAI and some open models) let you truncate vectors to fewer dimensions with minimal quality loss. A 3,072-dim vector can be cut to 256 dims and still perform well for many tasks.

    Model Comparison: What Matters

    Quality Benchmarks (MTEB)

    The Massive Text Embedding Benchmark ranks models across retrieval, classification, clustering, and reranking. Key findings:

  • Proprietary models (OpenAI, Cohere, Voyage) consistently rank top-5
  • Open models (E5-mistral, BGE) are competitive and improving fast
  • Domain matters — a model ranked #1 overall may not be best for your specific data
  • Multilingual models sacrifice some English quality for language breadth
  • Cost Comparison

    ProviderModelPrice per 1M tokens
    OpenAItext-embedding-3-small$0.02
    OpenAItext-embedding-3-large$0.13
    Cohereembed-v3$0.10
    Voyage AIvoyage-3$0.06
    Local (open)nomic-embed-text$0 (compute only)

    For a million-document corpus at ~500 tokens each: OpenAI small costs $10, large costs $65, and local is free but requires GPU infrastructure.

    Chunking Strategies

    Since embedding models have context windows, long documents must be chunked — split into smaller pieces. This is one of the most impactful decisions in any embedding pipeline.

    Common Strategies

    StrategyHow It WorksBest For
    Fixed-sizeSplit every N tokens with overlapSimple, predictable
    SentenceSplit at sentence boundariesNatural units of meaning
    ParagraphSplit at paragraph breaksWell-structured docs
    SemanticSplit when topic changes (using embeddings!)Long, varied documents
    RecursiveTry paragraph → sentence → token splitsGeneral purpose

    Chunk Size Trade-offs

  • Too small (< 100 tokens): Loses context. "It increased by 40%" — what increased?
  • Too large (> 1000 tokens): Dilutes meaning. A paragraph about pricing mixed with one about features produces a muddled vector.
  • Sweet spot: 200-500 tokens with 50-100 token overlap between chunks.
  • Overlap Matters

    Adjacent chunks should share some text (overlap) so that information at chunk boundaries isn't lost. Typical overlap: 10-20% of chunk size.

    Instruction-Tuned Embeddings

    Modern embedding models accept instructions that tell the model what kind of similarity to capture:

    # Without instruction
    embed("Apple released new products")
    → vector near fruits AND tech companies
    
    # With instruction: "Retrieve financial news"
    embed("Retrieve financial news: Apple released new products")
    → vector near tech companies, far from fruits

    Models like E5, Voyage, and Cohere embed-v3 support different input types (query vs document) that optimize the embedding for asymmetric search.

    Key Takeaways

  • Embedding pipeline: tokenize → encode → pool → normalize
  • Mean pooling over transformer outputs is the standard approach
  • Dimensionality trades nuance for storage and speed
  • Chunking is critical — 200-500 tokens with overlap is the sweet spot
  • Instruction-tuned models let you control what similarity means
  • Open models are catching up to proprietary ones in quality
  • This is chapter 2 of Vector Databases & Embeddings.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details