12 min

What Are Embeddings?

Turning Meaning into Math

The Problem with Keywords

Traditional search is literal — it matches exact words. Search "car insurance" and you won't find "auto coverage" or "vehicle protection." This is called the vocabulary mismatch problem, and it's been the biggest limitation of search for decades.

Embeddings solve this by converting text into vectors — arrays of numbers that capture meaning, not just characters.

What Is a Vector?

A vector is simply a list of numbers. A 3D vector might be [0.2, -0.5, 0.8]. Modern embedding models produce vectors with hundreds or thousands of dimensions — OpenAI's text-embedding-3-large produces 3,072 numbers per input.

Each dimension captures some aspect of meaning. No single dimension maps to a human concept like "formality" or "topic" — meaning is distributed across all dimensions simultaneously. This is why embeddings are sometimes called distributed representations.

Semantic Similarity

The magic: texts with similar meanings end up as vectors that are close together in this high-dimensional space. "The cat sat on the mat" and "A kitten rested on the rug" produce vectors that are nearly identical, even though they share almost no words.

This enables semantic search — finding results by meaning rather than keywords.

Distance Metrics

How do we measure "closeness" between vectors? Three main approaches:

Loading diagram...

Cosine similarity is the most widely used because it measures the angle between vectors, ignoring their magnitude. Two vectors pointing in the same direction have cosine similarity of 1.0, perpendicular vectors score 0.0, and opposite vectors score -1.0.

When to Use Which

Metric	Best For	Watch Out
Cosine	General-purpose semantic search	Slightly slower than dot product
Dot product	Pre-normalized vectors (most APIs)	Meaningless if vectors aren't normalized
Euclidean	When magnitude matters (e.g., popularity)	Dominated by high-magnitude dimensions

In practice, most embedding APIs return normalized vectors (length = 1), so cosine similarity and dot product give identical results. Use dot product — it's faster.

Why Keywords Fail: Real Examples

Query	Keyword Match	Semantic Match
"how to reduce cloud costs"	Documents with exact phrase	Documents about "infrastructure optimization," "right-sizing instances," "reserved capacity"
"employee unhappy with manager"	Documents with those words	Documents about "workplace conflict resolution," "management feedback," "retention risk signals"
"GDPR data deletion"	Documents mentioning GDPR	Documents about "right to erasure," "data subject requests," "PII removal workflows"

The Embedding Space

Think of the embedding space as a map of concepts. Similar concepts cluster together:

"Python," "JavaScript," "TypeScript" cluster near each other

"Revenue," "profit," "earnings" form another cluster

"Happy," "joyful," "elated" form yet another

But it's more nuanced than simple clustering. The relationships between concepts are also preserved. The vector from "king" to "queen" is similar to the vector from "man" to "woman." This is called analogical reasoning in embedding space.

Limitations of Embeddings

Embeddings aren't magic. Key limitations:

Context window — Most embedding models handle 512-8,192 tokens. Longer documents need chunking.

Domain specificity — A general embedding model may not capture domain jargon well. "Transformer" means different things in ML vs electrical engineering.

No reasoning — Embeddings capture similarity, not logic. "The cat chased the dog" and "The dog chased the cat" may have very similar embeddings despite opposite meanings.

Stale knowledge — Embeddings reflect training data. New terms or concepts won't be represented well until the model is updated.

Dimensionality trade-off — More dimensions = more nuance but more storage and slower search.

Key Takeaways

Embeddings convert text into vectors that capture meaning

Similar meanings produce nearby vectors in high-dimensional space

Cosine similarity is the standard distance metric

Embeddings enable semantic search that keywords can't match

They have real limitations around context, domain, and reasoning

This is chapter 1 of Vector Databases & Embeddings.

Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

View course details

Ch. 2: How Embedding Models Work