Comment by deepsquirrelnet

3 days ago

I think what you’re getting at could be addressed a few way. One is explainability — and with an llm you can ask it to tell you why it chose one or the other.

That’s not practical for a lot of applications, but it can do it.

For the cross encoder I trained, I have a pretty good idea what similar means because I created a semi-synthetic dataset that has variants based on 4 types of similarity.

Perhaps not a perfect solution when you’re really trying to split hairs about what is more similar between texts that are all pretty similar, but not all applications need that level of specificity either.