← Back to context

Comment by yobbo

10 months ago

Embeddings represent more than P("found in the same context").

It is true that cosine similarity is unhelpful if you expect it to be a distance measure.

[0,0,1] and [0,1,0] are orthogonal (cosine 0) but have euclidean distance √2, and 1/3 of vector elements are identical.

It is better if embeddings encode also angles, absolute and relative distances in some meaningful way. Testing only cosine ignores all distances.

Modern embeddings lie on a hypersphere surface, making euclidean equal to cosine. And if they don't, I probably wouldn't want to use them.

  • True, on a hypersphere cosine and euclidean are equivalent.

    But if random embeddings are gaussian, they are distributed on a "cloud" around the hypersphere, so they are not equal.