14 min

How Embedding Models Work

From Text to Vector

The Embedding Pipeline

Every embedding model follows the same basic pipeline: tokenize → encode → pool → normalize. Understanding each step helps you make better decisions about which model to use and how to prepare your data.

Loading diagram...

Step 1: Tokenization

Text is split into tokens — subword units that the model understands. "Embeddings" might become ["em", "bed", "ding", "s"]. Different models use different tokenizers:

Tokenizer	Used By	Vocab Size
BPE (Byte Pair Encoding)	OpenAI, GPT-family	50K-100K
WordPiece	BERT, E5	30K
SentencePiece	T5, multilingual models	32K-256K

Tokenization matters because it determines the model's context window — how many tokens it can process at once. Longer text = more tokens = potential truncation.

Step 2: Transformer Encoding

Each token is converted into a contextualized vector. The word "bank" gets a different vector in "river bank" versus "savings bank" because the transformer considers surrounding tokens through self-attention.

This is the critical difference from older approaches like Word2Vec, which gave "bank" the same vector regardless of context.

Step 3: Pooling

The transformer produces one vector per token, but we need a single vector for the entire input. Pooling strategies:

CLS token — Use the special [CLS] token's vector (BERT-style). Fast but sometimes misses information from later tokens.

Mean pooling — Average all token vectors. Most common and generally best. Used by Sentence-BERT, E5, and most modern models.

Last token — Use the final token's vector. Used by some GPT-based embedding models.

Step 4: Normalization

The output vector is scaled to unit length (magnitude = 1.0). This ensures cosine similarity equals dot product, simplifying and speeding up search.

Dimensionality: How Many Numbers?

Embedding dimension is a key design choice:

Model	Dimensions	Context Window
OpenAI text-embedding-3-small	1,536	8,191 tokens
OpenAI text-embedding-3-large	3,072	8,191 tokens
Cohere embed-v3	1,024	512 tokens
Voyage AI voyage-3	1,024	32,000 tokens
BGE-large-en-v1.5 (open)	1,024	512 tokens
E5-mistral-7b (open)	4,096	32,768 tokens
nomic-embed-text (open)	768	8,192 tokens

More dimensions = more nuance in meaning representation, but also more storage (each float32 = 4 bytes) and slower search.

Matryoshka embeddings (supported by OpenAI and some open models) let you truncate vectors to fewer dimensions with minimal quality loss. A 3,072-dim vector can be cut to 256 dims and still perform well for many tasks.

Model Comparison: What Matters

Quality Benchmarks (MTEB)

The Massive Text Embedding Benchmark ranks models across retrieval, classification, clustering, and reranking. Key findings:

Proprietary models (OpenAI, Cohere, Voyage) consistently rank top-5

Open models (E5-mistral, BGE) are competitive and improving fast

Domain matters — a model ranked #1 overall may not be best for your specific data

Multilingual models sacrifice some English quality for language breadth

Cost Comparison

Provider	Model	Price per 1M tokens
OpenAI	text-embedding-3-small	$0.02
OpenAI	text-embedding-3-large	$0.13
Cohere	embed-v3	$0.10
Voyage AI	voyage-3	$0.06
Local (open)	nomic-embed-text	$0 (compute only)

For a million-document corpus at ~500 tokens each: OpenAI small costs $10, large costs $65, and local is free but requires GPU infrastructure.

Chunking Strategies

Since embedding models have context windows, long documents must be chunked — split into smaller pieces. This is one of the most impactful decisions in any embedding pipeline.

Common Strategies

Strategy	How It Works	Best For
Fixed-size	Split every N tokens with overlap	Simple, predictable
Sentence	Split at sentence boundaries	Natural units of meaning
Paragraph	Split at paragraph breaks	Well-structured docs
Semantic	Split when topic changes (using embeddings!)	Long, varied documents
Recursive	Try paragraph → sentence → token splits	General purpose

Chunk Size Trade-offs

Too small (< 100 tokens): Loses context. "It increased by 40%" — what increased?

Too large (> 1000 tokens): Dilutes meaning. A paragraph about pricing mixed with one about features produces a muddled vector.

Sweet spot: 200-500 tokens with 50-100 token overlap between chunks.

Overlap Matters

Adjacent chunks should share some text (overlap) so that information at chunk boundaries isn't lost. Typical overlap: 10-20% of chunk size.

Instruction-Tuned Embeddings

Modern embedding models accept instructions that tell the model what kind of similarity to capture:

# Without instruction
embed("Apple released new products")
→ vector near fruits AND tech companies

# With instruction: "Retrieve financial news"
embed("Retrieve financial news: Apple released new products")
→ vector near tech companies, far from fruits

Models like E5, Voyage, and Cohere embed-v3 support different input types (query vs document) that optimize the embedding for asymmetric search.

Key Takeaways

Embedding pipeline: tokenize → encode → pool → normalize

Mean pooling over transformer outputs is the standard approach

Dimensionality trades nuance for storage and speed

Chunking is critical — 200-500 tokens with overlap is the sweet spot

Instruction-tuned models let you control what similarity means

Open models are catching up to proprietary ones in quality

This is chapter 2 of Vector Databases & Embeddings.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 1: What Are Embeddings?

Ch. 3: Vector Database Fundamentals