Gzip and KNN Outperforms Transformers on Text Classification

2 years ago (twitter.com)

Direct link to paper: https://aclanthology.org/2023.findings-acl.426.pdf

Intuitively, the key idea is that if you have two documents, x1 and x2, and a new document x, if x's statistical regularities are more similar to x1's than x2's, then len(compress(cat(x1,x))) - len(compress(x)) < len(compress(cat(x2,x))) - len(compress(x)), where "cat" is concatenation and "compress" is a compressor like gzip.

Quite literally, len(compress(cat(x1,x))) - len(compress(x)) is the number of additional bytes we need to compress the statistical regularities in x1 given the statistical regularities in x. The more similar the statistical regularities of x1 and x, the fewer additional bytes we need to compress cat(x1,x) compared to compressing only x.

The authors apply kNN on the compressed documents using a distance function called "normalized compression distance" (NCD), based on the above idea. They also discuss the relationship of NCD to information, Shannon entropy, and Kolmogorov complexity.

Remarkably, this simple, intuitive method outperforms BERT (but not necessarily larger, more recent transformers) on a variety of zero-shot classification tasks!

  • I'll add that this only outperforms on out of distribution data and in instances where tokens overlap. No semantic capability here. It's a win for sure but the title is misleading.

  • I wonder if you could get slightly better results by using zstd and taking advantage of zstd's support for "compression dictionaries" instead of simply concatenating the documents. Then compare the compressed size of a document with the compression dictionary versus without it. I know that zstd is able to achieve significantly higher compression ratios (at least at level 20+) than gzip, so whatever makes this work well with gzip (approximating Kolmogorov complexity?) might work better.

    • gzip/deflate also supports dictionaries[1], and in a roughly similar manner--you feed the encoder/decoder the dictionary data purely for the side-effect of updating internal state, without generating any output. But just as zstd supports much larger windows, it also supports much larger dictionaries.

      [1] There just aren't any good open source tools for creating that dictionary (but see https://blog.cloudflare.com/improving-compression-with-prese...), few examples of how to use the low-level library APIs to manage dictionaries (see, e.g., deflateSetDictionary at https://www.zlib.net/manual.html#Advanced), and common utilities don't expose this functionality, either.

  • So the problem to solve is "which of x1 and x2 is x more similar to", then? It feels like this is not what LLMs are solving so it doesn't surprise me that you can outperform them.

    What if x1 is in English and x is the same document in Hebrew? Would an LLM fare better?

  • Nitpick: it's not zero-shot but few-shot - it still needs a training set with prototypes to go off

    • Yes, it needs prototypes (i.e., possible neighbors) -- but it doesn't need any kind of pretraining to learn to map samples to a feature space.

      It uses gzip instead of a pretrained model to map samples to the feature space.

  • Fascinating. Thank you for the explanation

    I wonder if the same can be done for images

    Been playing with images lately and outputting jpegs. It’s amazing how the exact same base pixels, can generate so many different images, but the more noisy/random the images get, the bigger their jpg file size is, and conversely, the more photo-like the image, smaller the jpg size

    • I doubt it will work as well for image classification, because the statistical regularities in pixel values of two images depicting the same object can differ significantly. For example, the statistical regularities of pixel values depicting a black cat at night are more similar to those depicting a black dog at night than those depicting a white cat in the middle of the day, making it difficult to use the compressed statistical regularities to classify cats vs. dogs correctly.

      1 reply →

For anyone interested in the equivalence between AI and compression, take a look at the Hutter Prize :) http://prize.hutter1.net/

Also worth a look is the Large Text Compression Benchmark http://mattmahoney.net/dc/text.html - currently the world's best compressor is a neural network made by ... the renowned Fabrice Bellard, creator of ffmpeg and QEMU!

And I really dig these pages' refreshingly appropriate text-only style!

  • Specially compression algos that use arithmetic coding with interval weights adjusted based on the prediction of what is likely coming next are very similar. They adjust the arithmetic coding (https://en.wikipedia.org/wiki/Arithmetic_coding) based on the context of the byte/bit to predict, so the more accurate the predicted continuation is, the more efficient is the encoding. The task is very similar to that of the transformers like GPT. A perfect prediction will almost have no additional storage cost because the arithmetic interval doesn't get smaller, and thus no bit gets stored - but anyway, you have to count the size of the decompressor to get a fair benchmark.

    • How is it similar?

      There’s been a lot of study going the other direction - using neural networks to aid the entropy prediction in classical compression algorithms - but I’m not seeing the conceptual link between how transformer/attention models work internally and how gzip works internally beyond “similar words are easy to compress”

      I’m not seeing it because GPT representations are just vectors of fixed, not varying, size

      5 replies →

  • When you dive deep into the math, many things are fundamentally the same. Superresolution is just glorified deconvolution. A single layer perceptron is a linear kernel SVM is a logistic regression. FFT is just factorisation.

  • Something important to note is that the authors use normalized compression distance (NCD). NCD is a way to approximate Kolmogorov complexity.

    This is a pretty old idea, see [1,2]. Old but still very useful, like perceptrons.

    [1] Li and Vitanyi. An Introduction to Kolmogorov Complexity and Its Applications.

    [2] Clustering by compression. https://arxiv.org/pdf/cs/0312044

  • yes, absolutely!

    this kind of "compression" is essentially "understanding through theories" where theories are like those from physics

    which are a lot like stories that explain many things with just the same "characters". in this context a 'character' is more like a concept. for example an atom would fit this bill.

I'd like to note that this is only stronger on news.

Yahoo Questions it is not top performer. It's not far fetched to think that news are written in a similar way, sometimes even partly copied, and therefore have a lot of words in common. Yahoo Questions is a forum and I'd expect there to be a greater variation of word, but the word themself have a semantic similarity.

That is, gzip is strong when many words overlap (the size increase when gzipped is smaller) but if it's semantic similarity DNN's win everyday.

The results are interesting but not as interesting as it sounds IMO.

  • How do they work then that semantic similarity would be any different? That's just a matter of grouping semantically similar 'representations' in training, surely?

    • Yes, what I'm saying is that gzip does not perform as well when it's not overlapping tokens exact.

      Gzip does not support a "semantic" mode, hence it won't and does not (according to the papers metric) perform as well.

      Deep learning can capture these semantic similarities.

Very important to note that this is on out-of-distribution data (news in languages like 'Kinyarwanda, Kirundi, or Pinyin'). On a more normal setting, BERT still wins handds down.

Really nice to see that such a simple method can be very effective though. Still, people should not oversell this too much.

  • Yeah, this really should be emphasized more. Reading the headline I was absolutely amazed, this would be like stumbling upon an evidence of previously unknown and not yet described physics (or rather linguistics, in this case) law.

    But given your quotation this is actually even quite intuitive IMO. What is classification of texts in a completely unknown language? What would it be to ask you to classify texts in Kirundi language? You have no idea, what they mean, the best you can do it is to find out the frequency of some words (char sequences) and try to group texts with similar frequency fingerprints together. You still would have no clue what these texts actually mean, but it might (and turns out that it does) get you somewhere better than random. Well, good news: that's exactly what gzip+KNN do, it's their bread and butter, it's literally the only thing they live for.

    Reading (trying to understand, predicting the next character) these texts gets you pretty much nowhere. As a sensible human being, you wouldn't even try that, because it's just hopeless, you don't speak the language, what's more to say about that… Well, unfortunately, it's exactly what BERT does. The only thing it knows to do. We can congratulate it with getting more use out of it than a typical (and not quite typical too, I suppose) human would.

That's actually very clever and intuitively understandable.

If you concatenate two similar pieces of text together they will compress better than if you concatenated two dissimilar pieces of text.

  • Yeah, this is a known (if obscure) technique. The main contribution here is the formalization and measurement.

To me this is more of a negative for DL based similarity than a win for that method.

With this whole LLM craze (and they are incredible) I think a lot of people just assume we made similar advancements on the Embedding layer for pure text similarity.

Thus, all this embeddings db bonanza. But as far as I can see there is close to no evidence for that.

  • https://twitter.com/eugeneyan/status/1678060204943097863

    >When Deepmind needs semantic retrieval, they just use the largest index on the planet.

    Fun fact: Query-doc similarity was done via simple TF-IDF instead of vectors. It performed better than vector retrieval when retrieve docs > 45 (they used 50).

    https://blog.vespa.ai/improving-zero-shot-ranking-with-vespa...

    >This case illustrates that in-domain effectiveness does not necessarily transfer to an out-of-domain zero-shot application of the model. Generally, as observed on the BEIR dense leaderboard, dense embeddings models trained on NQ labels underperform the BM25 baseline across almost all BEIR datasets.

  • Could you answer a question please? To make a text embedding with a LLM, the kind you would use for similarity metrics, which layer is used? The input layer? Input layer + positional encoding? A hidden layer? The output layer?

Compression algorithms are an economization/compression of space(bits and bytes). ML models, especially generative models are an economization/compression of human expression and thought. Text classification is a type of compression over human expression. Is there perhaps something fundamental property about human language and data that can explain which one can do better at ML tasks?

There may come a day when such a theory takes shape and it may not be surprising that two could be related(some how) in some space where the encoding of compressed bits and bytes and compressed human expression are closely related. Indeed such a theory(entropy based? physics based?) of it might help people choose a compression algorithm over an ML one for certain types of compression over Human Expression.

Looking at the problem from a data driven pov, what are the hard negatives that cause these algorithms to behave poorly? Maybe that theory for now can only be approximated in terms of the data on the varied kinds of human text available. An example of kind of texts problem(one among many) is predicting mixtures using statistical topic models does well on academic text but has a hard time with internet text.

Is there someone out there working on such theories(besides wolfaram physics which I know of)?

This makes perfect sense. Compression is about "understanding", that is, representing the input in a way it can be recognized and labeled. When the recognized bits are larger than the labels, voila you get compression. I'm not surprised that gzip could be better at this task than DNN

  • I think compression is a subset of understanding. When your child starts to speak grammatically correct, they compressed all the language patterns they were exposed to into the grammar rules.

    I say subset because understanding is more general. There might be a specific compression algorithm that performs well on floating point numbers. In contrary, the brains and ANNs might be able to compress any input patterns with a worse performance.

I don’t see how gzip would handle words like “not” that flip the entire meaning of a sentence.

Anyone understand that?

  • As some of the comments on Twitter point out, this is for topic modelling. Negations might be less relevant there than in eg sentiment analysis.

Am I understanding correctly that the models are being tested on languages they've barely/never seen before?

How is it surprising that they would not be able to beat basic statistical techniques? That seems intuitive to me.

Also the title is misleading at best.

  • Yes it only outperforms on Our of distribution data and even then it only outperforms Bert and not bigger transformers. It's a win for sure but very misleading title.

Very cool and intuitive; if x0 and x1 are in category A, while x2 is in category B, then the concatenation of x0,x1 is more compressible than the concatenation of x0,x2. It makes sense that this would do better than an out of distribution NN.

Simplicity aside, practicality is unclear. It seems you can’t escape the need to perform at least one compression operation, per class, per inference. Eg classifying X into one of 10 categories requires a minimum of 10 string compressions. Probably more if you want better accuracy.

One important thing to not do here at any cost is to make a comparison between GPTs and KNNs and go "GPTs or BERT are meh like Gzip".

Remember that GPTs, for all their flaws[1], are learning the joint distribution P(X,Y) which can then generate text(or image/audio) because knowing the joint is what allows generation and prediction. But certain prediction tasks alone can be done by discriminative models that learn P(Y|X) well but these models generally have no clue(or much less of a clue) what human text actually is.

[1] which is they are in the end highly efficient and generalized stochastic 16k* NGram model on steroids which is a of the human language on the web with RHLF based lobotomy of the conditional distribution of the NGram model to say "non-offensive" things.

On out of distribution dataset, more specifically: Kinyarwanda news, Kirundi news, Filipino dengue, Swahili news, and Sogou news. On in-distribution Datasets BERT has a better accuracy on every dataset. It would be better if you add the "out-of-distribution" or OOD in the title.

Caveat buried in the abstract is that this beats BERT and non-pretrained Transformers. Looks like GPT style should still be better, but naturally requires a higher computation cost

How would gzip know "prohibit" is much closer to "ban" than to "allow"? It seems the other embeddings were just bad.

i am curious to know the results when using a text-specific compression method. Text specific compression would more likely approximate Kolmogorov complexity for their use case.

I'm also curious whether a lossy text compression scheme would work better. Lossy compression schemes reduce data size by eliminating certain information and accepting some degree of loss of original data quality, similar to how JPEGs work for images. Examples include simplifying language (e.g., converting British English to American English), using synonyms, abbreviating words, or even removing certain parts of the text that are deemed less important.

Even if you want to use a compression algo for this, you probably want to use lz4 with optimal parsing, you don't need the entropy coding and you don't need the checksum something like gzip (or more modern zstd) provides.

it sounds like things are about to get absurdly fast. finding traditional algorithms that are machine learning equivalent is huge.

Jaw dropping... so essentially DNNs also just "compress the information? is the take away here?

  • Why does this conclusion follow?

    Of course similar text compresses more efficiently, but NNs don’t work with compressed (varying-size) representations, they work with vector representations which happen to be close in similarity space

    • they work with compressed representations, you take an arbitrary information with varying entropy into a fized size vector representation, that's a compression.

      4 replies →

  • Well, yeah, but the training process means that the compression is both lossy and much less efficient than a standard compression method like gzip. You could even train your NN on its ability to losslessly recall, but we generally call that "overfitting" in the lingo.

    • The way you'd do compression using a NN, is using the NN to predict the probability of the next symbol, and feeding that into an arithmetic coder to produce a compressed representation. This process is lossless, and better prediction quality directly translates into better compression.

  • yes, biggest mindfuck is autoencoders. literally brute-force train a lossy compressor.