Comment by janalsncm

9 months ago

Cross encoders still don’t solve the fundamental problem of defining similarity that the author is referring to.

Frankly, the LLM approach the author talks about in the end doesn’t either. What does “similar” mean here?

Given inputs A, B, and C, you have to decide whether A and B are more similar or A and C are more similar. The algorithm (or architecture, depending on how you look at it) can’t do that for you. Dual encoder, cross encoder, bag of words, it doesn’t matter.

I think what you’re getting at could be addressed a few way. One is explainability — and with an llm you can ask it to tell you why it chose one or the other.

That’s not practical for a lot of applications, but it can do it.

For the cross encoder I trained, I have a pretty good idea what similar means because I created a semi-synthetic dataset that has variants based on 4 types of similarity.

Perhaps not a perfect solution when you’re really trying to split hairs about what is more similar between texts that are all pretty similar, but not all applications need that level of specificity either.