Comment by refulgentis
2 years ago
Really happy to see this: KNN + classification task + doing classification that's based on pure text similarity is a recipe for stacked results.
Schaudenfreude responses to this paper misunderstand that the natural language stuff is crucially important for embeddings: sure, phrases that share words will classify well and GZIP well, so GZIP can be used as ersatz classification.
The miracle of BERT / embeddings is _not_ having to share words: for instance, "what is my safe passcode?" has a strong match with "my lockbox pin is 1234", but not "my jewelry is stored safely in the safe".
This is also an important thing to consider with LLMs: people are using embeddings intended to do text similarity, whereas you want to use an SBERT model: that is trained to correlate a question to a document that will answer it.
> The miracle of BERT / embeddings is _not_ having to share words
To be fair, the original task is specifically chosen where something like knn+compression has a chance of being good: i.e. out of domain + low resource.
Under these conditions the training inputs could be too sparse for a highly parameterized model to learn good embeddings from.
In traditional in domain + big data classification settings there's no chance that non-parametric methods like compression would beat a learned representation.