Comment by brig90

5 months ago

Totally fair — I defaulted to paraphrase-multilingual-MiniLM-L12-v2 mostly for speed and wide compatibility, but you’re right that it’s long in the tooth by today’s standards. I’d be really curious to see how something like all-mpnet-base-v2 or even text-embedding-ada-002 would behave, especially if we keep the suffixes in and lean into full contextual embeddings rather than reducing to root forms.

Appreciate you calling that out — that’s a great push toward iteration.

1 comment

brig90

Ey7NFZ3P0nzAe 5 months ago

Be careful: they have super short context length AND silently crop if the text is too long. To me there is really no reason to use them.

I recommend ollama to run the artic-embed-v2 model, it also is multimingual and you can use --quantize when loading the modelfile to get it even smaller.