← Back to context

Comment by refulgentis

2 years ago

Really happy to see this: KNN + classification task + doing classification that's based on pure text similarity is a recipe for stacked results.

Schaudenfreude responses to this paper misunderstand that the natural language stuff is crucially important for embeddings: sure, phrases that share words will classify well and GZIP well, so GZIP can be used as ersatz classification.

The miracle of BERT / embeddings is _not_ having to share words: for instance, "what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but not "my jewelry is stored safely in the safe".

This is also an important thing to consider with LLMs: people are using embeddings intended to do text similarity, whereas you want to use an SBERT model: that is trained to correlate a question to a document that will answer it.

https://news.ycombinator.com/item?id=35377935

> The miracle of BERT / embeddings is _not_ having to share words

To be fair, the original task is specifically chosen where something like knn+compression has a chance of being good: i.e. out of domain + low resource.

Under these conditions the training inputs could be too sparse for a highly parameterized model to learn good embeddings from.

In traditional in domain + big data classification settings there's no chance that non-parametric methods like compression would beat a learned representation.