Comment by tensor

2 years ago

I honestly don't know why anyone would use this gzip approach in production. If you want to do text classification, really the two options you should consider are a best in class linear model like confidence weighted linear classification by Crammer (https://www.cs.jhu.edu/~mdredze/publications/icml_variance.p...) or a much more expensive LLMs.

Do you happen to be familiar with audio classification? There's been a ton of research on text classification and prediction but not many good papers I've seen for general audio classification. I'm talking more feature extraction, not speech recognition. There are a lot of speech recognition papers. So far I've been stuck on fft - image processing pipeline but I haven't gotten great results in real world tests, only on nice teat datasets.

Personally I don't have much experience working beyond mlp/rnn/lstm/cnn models.

Don't look at it as a suggestion to use gzip in production, but an invitation to reconsider the unassailable superiority of BERT over simpler, tailored solutions.

  • I don't think anyone actually doing NLP research has thought that BERT is always better than simpler methods. Linear classifiers with ngrams, or even better, large margin linear classifiers, are well known to be competitive with things like BERT on a variety of tasks, with orders of magnitude better runtime.

    In contrast, this gzip technique is considered a cute application of information theory, but even in academia is rarely included in studies because there are simpler and better techniques for NLP.

    Yes, if you are chasing the ultimate accuracy, then using a LLM (not necessarily BERT either) is going to be the best. But for a practical system trading some accuracy for vastly improved runtime is usually a very good trade-off. And again, it depends on your domain. Topic classification, stick with a linear model. Sentiment analysis? Ok, here a LLM actually gives substantially better results so it's worth the extra cost if sentiment is crucial to your application.

    I personally like the CW algorithm I mentioned because it's relatively easy to implement and has excellent qualities. But if I were a dev looking for a ready to go already implemented production system I'd go for vowpal wabbit and move up to a LLM if I'm not getting the accuracy I need for my application.

  • Is it really an invitation? The paper shows that the current models are worse for some marginalized languages that are used as OOD datasets. I am not really that surprised that the modelals don't speak those and I don't know anybody who would use BERT like that

  • But FastText (2015) already exists and beats this gzip approach on all criteria. So the invitation has already existed before BERT (2018) and continues to exist.