Comment by jabbany

2 years ago

> The miracle of BERT / embeddings is _not_ having to share words

To be fair, the original task is specifically chosen where something like knn+compression has a chance of being good: i.e. out of domain + low resource.

Under these conditions the training inputs could be too sparse for a highly parameterized model to learn good embeddings from.

In traditional in domain + big data classification settings there's no chance that non-parametric methods like compression would beat a learned representation.