Embeddings don’t work // where hallucinations begin

Embeddings can’t recognize simple logic that leads to errors

2 min readMar 11, 2024

from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

# 1. load model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1")

# For retrieval you need to pass this prompt.
#query = 'Represent this sentence for searching relevant passages: A man is eating a piece of bread'
query = "A man is not eating food."

docs = [
    query,
    "A man is eating food.",
    "A man is eating pasta.",
    "The girl is carrying a baby.",
    "A man is riding a horse.",
]

# 2. Encode
embeddings = model.encode(docs)

# 3. Calculate cosine similarity
similarities = cos_sim(embeddings[0], embeddings[1:])

print(similarities)

#tensor([[0.7354, 0.5976, 0.1352, 0.3812]])

Cosine-similarity is the cosine of the angle between two vectors, or equivalently the dot product between their normalizations. A popular application is to quantify semantic similarity between high-dimensional objects by applying cosine-similarity to a learned low-dimensional feature embedding. This can work better but sometimes also worse than the unnormalized dot-product between embedded vectors in practice. To gain insight into this empirical observation, we study embeddings derived from regularized linear models, where closed-form solutions facilitate analytical insights. We derive analytically how cosine-similarity can yield arbitrary and therefore meaningless ‘similarities.’ For some linear models the similarities are not even unique, while for others they are implicitly controlled by the regularization. We discuss implications beyond linear models: a combination of different regularizations are employed when learning deep models; these have implicit and unintended effects when taking cosinesimilarities of the resulting embeddings, rendering results opaque and possibly arbitrary. Based on these insights, we caution against blindly using cosine-similarity and outline alternatives.

The real solution is to extract and represent knowledge in some structured way (knowledge graphs, tables, trees, etc.) plus analyze and process logic (reasoning).

Embeddings don’t work // where hallucinations begin

Embeddings can’t recognize simple logic that leads to errors

Written by sbagency

No responses yet