Comment by ks2048

3 years ago

Hi, that's my blog post. I'm pretty sure about what I wrote here, but may need the authors to chime-in in case I am missing something. I just submited an issue on github, https://github.com/bazingagin/npc_gzip/issues/3

38 comments

ks2048

eyegor 3 years ago

You may want to consider adding a note to the top. Seems like a lot of people are lazily skimming/reading the headline and see it as "gzip paper full of beans, gzip approach sucks" when really I see this as "gzip approach not better than dnn models but mostly competes and much cheaper to run". The paper is still solid.

marcinzm 3 years ago
>gzip approach not better than dnn models but mostly competes and much cheaper to run
Does it? It looks to do worse than FastText in all benchmarks and kNN is not a cheap algorithm to run so it might actually be slower than FastText.
edit: It looks like FastText takes 5 seconds to train on the Yahoo Answers data set while the gzip approach took them 6 days. So definitely not faster.
- eyegor 3 years ago
  
  I'm not familiar with most of these models in detail, but training time is generally less interesting than inference time to me. I don't care if it takes a month to train on $10k of gpu rentals if it can be deployed and run on a raspberry pi. I should definitely look into fasttext though.
  
  5 replies →
- tensor 3 years ago
  
  FastText isn't a LLM, it's a token embedding model with a simple classifier on top.
  
  1 reply →
tensor 3 years ago
I honestly don't know why anyone would use this gzip approach in production. If you want to do text classification, really the two options you should consider are a best in class linear model like confidence weighted linear classification by Crammer (https://www.cs.jhu.edu/~mdredze/publications/icml_variance.p...) or a much more expensive LLMs.
- eyegor 3 years ago
  
  Do you happen to be familiar with audio classification? There's been a ton of research on text classification and prediction but not many good papers I've seen for general audio classification. I'm talking more feature extraction, not speech recognition. There are a lot of speech recognition papers. So far I've been stuck on fft - image processing pipeline but I haven't gotten great results in real world tests, only on nice teat datasets.
  Personally I don't have much experience working beyond mlp/rnn/lstm/cnn models.
- esafak 3 years ago
  
  Don't look at it as a suggestion to use gzip in production, but an invitation to reconsider the unassailable superiority of BERT over simpler, tailored solutions.
  
  3 replies →
light_hue_1 3 years ago
If this story is true the paper is not solid.
Claims in the abstract and claim 3 in the paper, as well as much of the publicity around the paper is just wrong.
It takes gzip from being great out of domain to being middling at best. It goes from something really interesting to a "meh" model. The main part that was intellectually interesting is how robust gzip is out of domain, if that's gone, there isn't much here.
If I was the reviewer for this paper, this would take the paper from an accept to a "submit to a workshop".
Also, kNN methods are slow O(n^2).
- ivirshup 3 years ago
  
  kNN methods are broadly not O(n^2)[1], especially in practice where approximate methods are used.
  [1]: https://en.wikipedia.org/wiki/Nearest_neighbor_search
  
  1 reply →

ginbazinga 3 years ago

Hi, I'm the first author of the paper and I read your blog post. The reason I chose k=2 is because it's recommended to use n^{1/2} and I wanted to pick a k that's compatible with 5-shot setting. But you are right this is a bit odd. As I mentioned in both the paper and the twitter, different values of k will affect the result and I reported the _max_ result we can get so it does mean an ideal situation when the guess is always right. I also use this strategy for W2V and SentBERT. But it doesn't make the result top2 accuracy. As far as I know, top2 accuracy means in the top2 predicted classes, either one is correct, we score. However, as you mentioned, in knn when k=2, there is a situation when the 2 nearest neighbours point to the same class - we are missing another class candidate if reporting top2 accuracy. But I would love to add results for different strategies and different k values when I upload newest version to arxiv when I have time. The strategy of "decrement" you mentioned in the blog is really nice and I would love to add it to the repo if you want. I apologize for the short and late reply; I haven't checked the repo yet. I'm preparing for tomorrow's thesis defence and will reply & resolve the issue when I finish.

lalaland1125 3 years ago

Just wanted to say, thanks for your work debugging this.

You have probably saved other researchers an unfathomable amount of time

syats 3 years ago

Thanks for the replication, this is important.

One question, did you try to replicate the other result table (Table 3)?

If I understand correctly, top-2 accuracy would be 1 if you have only 2 classes, but it will differ from "normal" accuracy less and less as the number of classes increases (on average). So this shouldn't change the results for table 3 thaaat much as the datasets have large amounts of classes (see table 1).

In any case, top-2 accuracy of 0.685 for the 20-newsgroups dataset is pretty neat for a method that doesn't even consider characters as characters[1], let alone tokens, n-grams, embeddings and all the nice stuff that those of use working on NLP have been devoting years to.

[1] In my understanding of gzip, it considers only bit sequences, which are not necessarily aligned with words (aka. bytes).

ks2048 3 years ago

I haven't yet replicated Table 3 because most of those datasets are much larger and it will take awhile to run (they said the YahooAnswers database took them 6 days).
Also, I have only tried the "gzip" row because that is all that is in the github repo they referenced.
Yeah, you're right, the more classes there are, probably the lower the effect this will have.

p1esk 3 years ago

Did you try contacting the authors before you went public with your discovery?

_b 3 years ago
We're adult enough to have discussions like this in public. They are healthy to have. People make mistakes. Kudos to the original authors for releasing the source code so people could inspect and replicate their results.
- returningfory2 3 years ago
  
  I agree, and just want to add: nowadays it's super common for researchers to widely publicize their new work on social media. The blog post here even mentions "Table 5 from the paper was often included in tweets".
  In this context of sharing your results very publicly, it seems only fair that rebuttals would be very public, too. Otherwise researchers would be highly incentivized to very publicly publish weak results because they would get a lot of positive publicity when they share the results, but not much negative publicity when the weaknesses are found and shared.
jarym 3 years ago

It isn't a security issue and doesn't warrant responsible disclosure so why would op be expected to?
ks2048 3 years ago

I did not, but I see why that could be a better approach. I mainly am trying to be more open with little side projects I do, so wanting to start blogging what I'm working on. Also, this paper was beiung widely discussed so thought this would be one more entry in that.

cs702 3 years ago

Whoa, I just read your post and saw this:

> tldr: it is reporting a top-2 accuracy rather than a kNN(k=2) accuracy

If the accuracy figures shown for other models are top-1, that's a pretty significant mistake, hopefully an innocent one.

Thank you for doing this and sharing your findings!

---

Previous discussion on HN: https://news.ycombinator.com/item?id=36707193

DebtDeflation 3 years ago
Also:
>k=2 is a bit of an odd choice for kNN classification
That's an understatement. Choosing an even number for any sort of voting algorithm doesn't make much sense, choosing 2 specifically probably makes the least sense of all.
- spi 3 years ago
  
  Yeah that is a big red flag - as the OP mentions, there is basically no way of making k=2 statistically different from k=1, that's why nobody uses it.
  I suppose the authors just tried many different k and selected k=2 because it performed surprisingly well (likely due to the bug the OP found out). But if the results were significantly better than k=1 or k=3, it's a bit weird the authors never double checked why that was the case. I guess it can be one of those things you settle on early in the overall process with a few experiments, and just take for granted afterwards, never checking it again, but still, it sounds like something that should pop out at some point while writing the paper...?
- ks2048 3 years ago
  
  Yeah, I think this is one part where a reviewer or advisor could have focused questions.
  There is a sentence in Appendix C: "We set k = 2 for all the methods on all the datasets and we report the maximum possible accuracy getting from the experiments for each method."
  I'm not sure what the second part of that means exactly.
- 1024core 3 years ago
  
  True. You want to always use an odd number so there are no ties.
  I'm guessing they were trying a parameter sweep, and found that (thanks to the bug) they got the best results for K=2.
  This too is problematic in its own sense.
  
  2 replies →