← Back to context

Comment by bjourne

4 days ago

So word vectors solve the problem that two words may never appear in the same context, yet can be strongly correlated. "Python" may never be found close to "Ruby", yet "scripting" is likely to be found in both their contexts so the embedding algorithm will ensure that they are close in some vector space. Except it rarely works well because of the curse of dimensionality.

Perhaps one could represent word embeddings as vertices, rather than vectors? Suppose you find "Python" and "scripting" in the same context. You draw a weighted edge between them. If you find the same words again you reduce the weight of the edge. Then to compute the similarity between two words, just compute the weighted shortest path between their vertices. You could extend it to pair-wise sentence similarity using Steiner trees. Of course it would be much slower than cosine similarity, but probably also much more useful.

Embeddings represent more than P("found in the same context").

It is true that cosine similarity is unhelpful if you expect it to be a distance measure.

[0,0,1] and [0,1,0] are orthogonal (cosine 0) but have euclidean distance √2, and 1/3 of vector elements are identical.

It is better if embeddings encode also angles, absolute and relative distances in some meaningful way. Testing only cosine ignores all distances.

  • Modern embeddings lie on a hypersphere surface, making euclidean equal to cosine. And if they don't, I probably wouldn't want to use them.

    • True, on a hypersphere cosine and euclidean are equivalent.

      But if random embeddings are gaussian, they are distributed on a "cloud" around the hypersphere, so they are not equal.

This was called ontology or semantic network. See e.g. OpenCyc (although it's rather more elaborate). What you propose is rather different than word embeddings, since it can't compare word features (think: connotations) nor ambiguity, and the way to discover similarities symbolically is not a well-understood problem.