Comment by electroglyph

2 days ago

i don't think many people are having luck replicating those benchmarks, the models are a bit weird

I can't trust MTEB as there's been a huge difference between benchmark scores and actual performance.

I made a small tool to help me compare various embedding models: https://www.vectorsimilaritytest.com/

Qwen embedding models score very highly but are highly sensitive to word order (they use last token pooling which simplified means they only look at the last word of the input). Change the word order and the scores change completely. Voyage models score highly too, but changing a word from singular to plural can again completely change the scores.

I find myself doing a hybrid search, rerank and shortlist the results, then feed them to an LLM to judge what is and isn't relevant.