What Are Embeddings?
Turning Meaning into Math
The Problem with Keywords
Traditional search is literal — it matches exact words. Search "car insurance" and you won't find "auto coverage" or "vehicle protection." This is called the vocabulary mismatch problem, and it's been the biggest limitation of search for decades.
Embeddings solve this by converting text into vectors — arrays of numbers that capture meaning, not just characters.
What Is a Vector?
A vector is simply a list of numbers. A 3D vector might be [0.2, -0.5, 0.8]. Modern embedding models produce vectors with hundreds or thousands of dimensions — OpenAI's text-embedding-3-large produces 3,072 numbers per input.
Each dimension captures some aspect of meaning. No single dimension maps to a human concept like "formality" or "topic" — meaning is distributed across all dimensions simultaneously. This is why embeddings are sometimes called distributed representations.
Semantic Similarity
The magic: texts with similar meanings end up as vectors that are close together in this high-dimensional space. "The cat sat on the mat" and "A kitten rested on the rug" produce vectors that are nearly identical, even though they share almost no words.
This enables semantic search — finding results by meaning rather than keywords.
Distance Metrics
How do we measure "closeness" between vectors? Three main approaches:
Cosine similarity is the most widely used because it measures the angle between vectors, ignoring their magnitude. Two vectors pointing in the same direction have cosine similarity of 1.0, perpendicular vectors score 0.0, and opposite vectors score -1.0.
When to Use Which
| Metric | Best For | Watch Out |
|---|---|---|
| Cosine | General-purpose semantic search | Slightly slower than dot product |
| Dot product | Pre-normalized vectors (most APIs) | Meaningless if vectors aren't normalized |
| Euclidean | When magnitude matters (e.g., popularity) | Dominated by high-magnitude dimensions |
In practice, most embedding APIs return normalized vectors (length = 1), so cosine similarity and dot product give identical results. Use dot product — it's faster.
Why Keywords Fail: Real Examples
| Query | Keyword Match | Semantic Match |
|---|---|---|
| "how to reduce cloud costs" | Documents with exact phrase | Documents about "infrastructure optimization," "right-sizing instances," "reserved capacity" |
| "employee unhappy with manager" | Documents with those words | Documents about "workplace conflict resolution," "management feedback," "retention risk signals" |
| "GDPR data deletion" | Documents mentioning GDPR | Documents about "right to erasure," "data subject requests," "PII removal workflows" |
The Embedding Space
Think of the embedding space as a map of concepts. Similar concepts cluster together:
But it's more nuanced than simple clustering. The relationships between concepts are also preserved. The vector from "king" to "queen" is similar to the vector from "man" to "woman." This is called analogical reasoning in embedding space.
Limitations of Embeddings
Embeddings aren't magic. Key limitations:
Key Takeaways
This is chapter 1 of Vector Databases & Embeddings.
Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.
View course details