Back to guides
1
12 min

What Are Embeddings?

Turning Meaning into Math

The Problem with Keywords

Traditional search is literal — it matches exact words. Search "car insurance" and you won't find "auto coverage" or "vehicle protection." This is called the vocabulary mismatch problem, and it's been the biggest limitation of search for decades.

Embeddings solve this by converting text into vectors — arrays of numbers that capture meaning, not just characters.

What Is a Vector?

A vector is simply a list of numbers. A 3D vector might be [0.2, -0.5, 0.8]. Modern embedding models produce vectors with hundreds or thousands of dimensions — OpenAI's text-embedding-3-large produces 3,072 numbers per input.

Each dimension captures some aspect of meaning. No single dimension maps to a human concept like "formality" or "topic" — meaning is distributed across all dimensions simultaneously. This is why embeddings are sometimes called distributed representations.

Semantic Similarity

The magic: texts with similar meanings end up as vectors that are close together in this high-dimensional space. "The cat sat on the mat" and "A kitten rested on the rug" produce vectors that are nearly identical, even though they share almost no words.

This enables semantic search — finding results by meaning rather than keywords.

Distance Metrics

How do we measure "closeness" between vectors? Three main approaches:

Loading diagram...

Cosine similarity is the most widely used because it measures the angle between vectors, ignoring their magnitude. Two vectors pointing in the same direction have cosine similarity of 1.0, perpendicular vectors score 0.0, and opposite vectors score -1.0.

When to Use Which

MetricBest ForWatch Out
CosineGeneral-purpose semantic searchSlightly slower than dot product
Dot productPre-normalized vectors (most APIs)Meaningless if vectors aren't normalized
EuclideanWhen magnitude matters (e.g., popularity)Dominated by high-magnitude dimensions

In practice, most embedding APIs return normalized vectors (length = 1), so cosine similarity and dot product give identical results. Use dot product — it's faster.

Why Keywords Fail: Real Examples

QueryKeyword MatchSemantic Match
"how to reduce cloud costs"Documents with exact phraseDocuments about "infrastructure optimization," "right-sizing instances," "reserved capacity"
"employee unhappy with manager"Documents with those wordsDocuments about "workplace conflict resolution," "management feedback," "retention risk signals"
"GDPR data deletion"Documents mentioning GDPRDocuments about "right to erasure," "data subject requests," "PII removal workflows"

The Embedding Space

Think of the embedding space as a map of concepts. Similar concepts cluster together:

  • "Python," "JavaScript," "TypeScript" cluster near each other
  • "Revenue," "profit," "earnings" form another cluster
  • "Happy," "joyful," "elated" form yet another
  • But it's more nuanced than simple clustering. The relationships between concepts are also preserved. The vector from "king" to "queen" is similar to the vector from "man" to "woman." This is called analogical reasoning in embedding space.

    Limitations of Embeddings

    Embeddings aren't magic. Key limitations:

  • Context window — Most embedding models handle 512-8,192 tokens. Longer documents need chunking.
  • Domain specificity — A general embedding model may not capture domain jargon well. "Transformer" means different things in ML vs electrical engineering.
  • No reasoning — Embeddings capture similarity, not logic. "The cat chased the dog" and "The dog chased the cat" may have very similar embeddings despite opposite meanings.
  • Stale knowledge — Embeddings reflect training data. New terms or concepts won't be represented well until the model is updated.
  • Dimensionality trade-off — More dimensions = more nuance but more storage and slower search.
  • Key Takeaways

  • Embeddings convert text into vectors that capture meaning
  • Similar meanings produce nearby vectors in high-dimensional space
  • Cosine similarity is the standard distance metric
  • Embeddings enable semantic search that keywords can't match
  • They have real limitations around context, domain, and reasoning
  • This is chapter 1 of Vector Databases & Embeddings.

    Get the full hands-on course — free during early access. Build the complete system. Your projects become your portfolio.

    View course details