← Back to context

Comment by deepsquirrelnet

4 days ago

> So, what can we use instead?

> The most powerful approach

> The best approach is to directly use LLM query to compare two entries.

Cross encoders are a solution I’m quite fond of, high performing and much faster. I recently put an STS cross encoder up on huggingface based on ModernBERT that performs very well.

I had to look that up… for others:

An STS cross encoder is a model that uses the CrossEncoder class to predict the semantic similarity between two sentences. STS stands for Semantic Textual Similarity.

Technically speaking, cross encoders are LLMs - they use the last layer to predict similarity (a single number) rather than the probability of the next token. They are faster than generative models only if they are simpler - otherwise, there is no performance gain (the last layer is negligible). In any case, even the simplest cross-encoders are more computationally intensive than those using a dot product from pre-computed vectors.

That said, for many applications, we may be perfectly fine with some version of a fine-tuned BERT-like model rather than using the newest AGI-like SoTA just to compare if two products are vaguely similar, and it is worth putting the other one in suggestions.

  • This is true, and I’ve done quite a bit with static embeddings. You can check out my wordllama project if that’s interesting to you.

    https://github.com/dleemiller/WordLlama

    There’s also model2vec doing some cool things as well in that area. So it’s cool to see recent progress in 2024/5 on simple static embedding models.

    On the computational performance note, the performance of cross encoder I trained using ModernBERT base is on par with the roberta large model, while being about 7-8x faster. Still way more complex than static, but on benchmark datasets, much more capable too.

Cross encoders still don’t solve the fundamental problem of defining similarity that the author is referring to.

Frankly, the LLM approach the author talks about in the end doesn’t either. What does “similar” mean here?

Given inputs A, B, and C, you have to decide whether A and B are more similar or A and C are more similar. The algorithm (or architecture, depending on how you look at it) can’t do that for you. Dual encoder, cross encoder, bag of words, it doesn’t matter.

  • I think what you’re getting at could be addressed a few way. One is explainability — and with an llm you can ask it to tell you why it chose one or the other.

    That’s not practical for a lot of applications, but it can do it.

    For the cross encoder I trained, I have a pretty good idea what similar means because I created a semi-synthetic dataset that has variants based on 4 types of similarity.

    Perhaps not a perfect solution when you’re really trying to split hairs about what is more similar between texts that are all pretty similar, but not all applications need that level of specificity either.