How Embedding Models Work
From Text to Vector
The Embedding Pipeline
Every embedding model follows the same basic pipeline: tokenize → encode → pool → normalize. Understanding each step helps you make better decisions about which model to use and how to prepare your data.
Step 1: Tokenization
Text is split into tokens — subword units that the model understands. "Embeddings" might become ["em", "bed", "ding", "s"]. Different models use different tokenizers:
| Tokenizer | Used By | Vocab Size |
|---|---|---|
| **BPE** (Byte Pair Encoding) | OpenAI, GPT-family | 50K-100K |
| WordPiece | BERT, E5 | 30K |
| SentencePiece | T5, multilingual models | 32K-256K |
Tokenization matters because it determines the model's context window — how many tokens it can process at once. Longer text = more tokens = potential truncation.
Step 2: Transformer Encoding
Each token is converted into a contextualized vector. The word "bank" gets a different vector in "river bank" versus "savings bank" because the transformer considers surrounding tokens through self-attention.
This is the critical difference from older approaches like Word2Vec, which gave "bank" the same vector regardless of context.
Step 3: Pooling
The transformer produces one vector per token, but we need a single vector for the entire input. Pooling strategies:
Step 4: Normalization
The output vector is scaled to unit length (magnitude = 1.0). This ensures cosine similarity equals dot product, simplifying and speeding up search.
Dimensionality: How Many Numbers?
Embedding dimension is a key design choice:
| Model | Dimensions | Context Window |
|---|---|---|
| OpenAI text-embedding-3-small | 1,536 | 8,191 tokens |
| OpenAI text-embedding-3-large | 3,072 | 8,191 tokens |
| Cohere embed-v3 | 1,024 | 512 tokens |
| Voyage AI voyage-3 | 1,024 | 32,000 tokens |
| **BGE-large-en-v1.5** (open) | 1,024 | 512 tokens |
| **E5-mistral-7b** (open) | 4,096 | 32,768 tokens |
| **nomic-embed-text** (open) | 768 | 8,192 tokens |
More dimensions = more nuance in meaning representation, but also more storage (each float32 = 4 bytes) and slower search.
Matryoshka embeddings (supported by OpenAI and some open models) let you truncate vectors to fewer dimensions with minimal quality loss. A 3,072-dim vector can be cut to 256 dims and still perform well for many tasks.
Model Comparison: What Matters
Quality Benchmarks (MTEB)
The Massive Text Embedding Benchmark ranks models across retrieval, classification, clustering, and reranking. Key findings:
Cost Comparison
| Provider | Model | Price per 1M tokens |
|---|---|---|
| OpenAI | text-embedding-3-small | $0.02 |
| OpenAI | text-embedding-3-large | $0.13 |
| Cohere | embed-v3 | $0.10 |
| Voyage AI | voyage-3 | $0.06 |
| Local (open) | nomic-embed-text | $0 (compute only) |
For a million-document corpus at ~500 tokens each: OpenAI small costs $10, large costs $65, and local is free but requires GPU infrastructure.
Chunking Strategies
Since embedding models have context windows, long documents must be chunked — split into smaller pieces. This is one of the most impactful decisions in any embedding pipeline.
Common Strategies
| Strategy | How It Works | Best For |
|---|---|---|
| Fixed-size | Split every N tokens with overlap | Simple, predictable |
| Sentence | Split at sentence boundaries | Natural units of meaning |
| Paragraph | Split at paragraph breaks | Well-structured docs |
| Semantic | Split when topic changes (using embeddings!) | Long, varied documents |
| Recursive | Try paragraph → sentence → token splits | General purpose |
Chunk Size Trade-offs
Overlap Matters
Adjacent chunks should share some text (overlap) so that information at chunk boundaries isn't lost. Typical overlap: 10-20% of chunk size.
Instruction-Tuned Embeddings
Modern embedding models accept instructions that tell the model what kind of similarity to capture:
# Without instruction
embed("Apple released new products")
→ vector near fruits AND tech companies
# With instruction: "Retrieve financial news"
embed("Retrieve financial news: Apple released new products")
→ vector near tech companies, far from fruitsModels like E5, Voyage, and Cohere embed-v3 support different input types (query vs document) that optimize the embedding for asymmetric search.
Key Takeaways
This is chapter 2 of Vector Databases & Embeddings.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details