Comment by celestialcheese

3 years ago

Very interested in this - I've been using embeddings / semantic search doing information retrieval from PDFs, using ada-002, and have been impressed by the results in testing.

The reasons the article listed, namely a) lock-in and b) cost, have given me pause with embedding our whole corpus of data. I'd much rather use an open model but don't have much experience in evaluating these embedding models and search performance - still very new to me.

Like what you did with ada-002 vs Instruct XL, has there been any papers or prior work done evaluating the different embedding models?

You can find some comparisons and evaluation datasets/tasks here: https://www.sbert.net/docs/pretrained_models.html

Generally MiniLM is a good baseline. For faster models you want this library:

https://github.com/oborchers/Fast_Sentence_Embeddings

For higher quality ones, just take the bigger/slower models in the SentenceTransformers library

  • Is there performance comparisons for Apple Silicon machines?

    • Performance in terms of model quality would be the same.

      The fast-se library uses C++ code and word embeddings being averaged to generate sentence embeddings, so would be similarly fast, or faster on apple silicon than x86.

      For the SentenceTransformer library models I'm not sure, but I think it would run off the CPU for a M1/M2 computer