← Back to context

Comment by pamelafox

5 days ago

I should clarify that I experienced these issues with text-embedding-ada-002 and the Azure AI vision model (based on Florence). I have not tested many other embedding models to see if they'd have the same issue.

FWIW I think you're right, we have very different stacks, and I've observed the same thing, with a much clunkier description thank your elegant way of putting it.

I do embeddings on arbitrary websites at runtime, and had a persistent problem with the last chunk of a web page matching more things. In retrospect, its obvious that the smaller the chunk was, the more it was matching everything

Full details: MSMARCO MiniLM L6V3 inferenced using ONNX on iOS/web/android/macos/windows/linux

You could also work around this by adding a scaling transformation that normalizes and centers (e.g. sklearn StandardScaler) in between the raw embeddings — based on some example data points from your data set. Might introduce some bias, but I’ve found this helpful in some cases with off the shelf embeddings.

Use horrible quality embeddings and get horrible results. No surprise there. ada is obsolete - I would never want to use it.