Comment by eyegor

3 years ago

You may want to consider adding a note to the top. Seems like a lot of people are lazily skimming/reading the headline and see it as "gzip paper full of beans, gzip approach sucks" when really I see this as "gzip approach not better than dnn models but mostly competes and much cheaper to run". The paper is still solid.

19 comments

eyegor

marcinzm 3 years ago

>gzip approach not better than dnn models but mostly competes and much cheaper to run

Does it? It looks to do worse than FastText in all benchmarks and kNN is not a cheap algorithm to run so it might actually be slower than FastText.

edit: It looks like FastText takes 5 seconds to train on the Yahoo Answers data set while the gzip approach took them 6 days. So definitely not faster.

eyegor 3 years ago
I'm not familiar with most of these models in detail, but training time is generally less interesting than inference time to me. I don't care if it takes a month to train on $10k of gpu rentals if it can be deployed and run on a raspberry pi. I should definitely look into fasttext though.
- amluto 3 years ago
  
  As described in the paper, it didn't look like the gzip classifier trained at all. Inference involved reading the entire training set.
  One could surely speed this up by preprocessing the training set and snapshotting the resulting gzip state, but that wouldn't affect the asymptotic complexity. In effect, the number of parameters is effectively equal to the size of the entire training set. (Of course, lots of fancy models scale roughly like this, too, so this isn't necessarily a loss.)
  
  4 replies →
tensor 3 years ago
FastText isn't a LLM, it's a token embedding model with a simple classifier on top.
- marcinzm 3 years ago
  
  Sure but it's existence means the statement is really "gzip approach not better than dnn models, and doesn't compete or be cheaper to run than previous models like FastText." That's not a very meaningful value statement for the approach (although why gzip is even half-decent might be a very interesting research question).

tensor 3 years ago

I honestly don't know why anyone would use this gzip approach in production. If you want to do text classification, really the two options you should consider are a best in class linear model like confidence weighted linear classification by Crammer (https://www.cs.jhu.edu/~mdredze/publications/icml_variance.p...) or a much more expensive LLMs.

eyegor 3 years ago

Do you happen to be familiar with audio classification? There's been a ton of research on text classification and prediction but not many good papers I've seen for general audio classification. I'm talking more feature extraction, not speech recognition. There are a lot of speech recognition papers. So far I've been stuck on fft - image processing pipeline but I haven't gotten great results in real world tests, only on nice teat datasets.
Personally I don't have much experience working beyond mlp/rnn/lstm/cnn models.
esafak 3 years ago
Don't look at it as a suggestion to use gzip in production, but an invitation to reconsider the unassailable superiority of BERT over simpler, tailored solutions.
- tensor 3 years ago
  
  I don't think anyone actually doing NLP research has thought that BERT is always better than simpler methods. Linear classifiers with ngrams, or even better, large margin linear classifiers, are well known to be competitive with things like BERT on a variety of tasks, with orders of magnitude better runtime.
  In contrast, this gzip technique is considered a cute application of information theory, but even in academia is rarely included in studies because there are simpler and better techniques for NLP.
  Yes, if you are chasing the ultimate accuracy, then using a LLM (not necessarily BERT either) is going to be the best. But for a practical system trading some accuracy for vastly improved runtime is usually a very good trade-off. And again, it depends on your domain. Topic classification, stick with a linear model. Sentiment analysis? Ok, here a LLM actually gives substantially better results so it's worth the extra cost if sentiment is crucial to your application.
  I personally like the CW algorithm I mentioned because it's relatively easy to implement and has excellent qualities. But if I were a dev looking for a ready to go already implemented production system I'd go for vowpal wabbit and move up to a LLM if I'm not getting the accuracy I need for my application.
- empiko 3 years ago
  
  Is it really an invitation? The paper shows that the current models are worse for some marginalized languages that are used as OOD datasets. I am not really that surprised that the modelals don't speak those and I don't know anybody who would use BERT like that
- marcinzm 3 years ago
  
  But FastText (2015) already exists and beats this gzip approach on all criteria. So the invitation has already existed before BERT (2018) and continues to exist.

light_hue_1 3 years ago

If this story is true the paper is not solid.

Claims in the abstract and claim 3 in the paper, as well as much of the publicity around the paper is just wrong.

It takes gzip from being great out of domain to being middling at best. It goes from something really interesting to a "meh" model. The main part that was intellectually interesting is how robust gzip is out of domain, if that's gone, there isn't much here.

If I was the reviewer for this paper, this would take the paper from an accept to a "submit to a workshop".

Also, kNN methods are slow O(n^2).

ivirshup 3 years ago
kNN methods are broadly not O(n^2)[1], especially in practice where approximate methods are used.
[1]: https://en.wikipedia.org/wiki/Nearest_neighbor_search
- huac 3 years ago
  
  how would you build an index over the gzip encoded data? seems quite different from building indices over vector embeddings.