← Back to context

Comment by cs702

2 years ago

Direct link to paper: https://aclanthology.org/2023.findings-acl.426.pdf

Intuitively, the key idea is that if you have two documents, x1 and x2, and a new document x, if x's statistical regularities are more similar to x1's than x2's, then len(compress(cat(x1,x))) - len(compress(x)) < len(compress(cat(x2,x))) - len(compress(x)), where "cat" is concatenation and "compress" is a compressor like gzip.

Quite literally, len(compress(cat(x1,x))) - len(compress(x)) is the number of additional bytes we need to compress the statistical regularities in x1 given the statistical regularities in x. The more similar the statistical regularities of x1 and x, the fewer additional bytes we need to compress cat(x1,x) compared to compressing only x.

The authors apply kNN on the compressed documents using a distance function called "normalized compression distance" (NCD), based on the above idea. They also discuss the relationship of NCD to information, Shannon entropy, and Kolmogorov complexity.

Remarkably, this simple, intuitive method outperforms BERT (but not necessarily larger, more recent transformers) on a variety of zero-shot classification tasks!

I wonder if you could get slightly better results by using zstd and taking advantage of zstd's support for "compression dictionaries" instead of simply concatenating the documents. Then compare the compressed size of a document with the compression dictionary versus without it. I know that zstd is able to achieve significantly higher compression ratios (at least at level 20+) than gzip, so whatever makes this work well with gzip (approximating Kolmogorov complexity?) might work better.

  • gzip/deflate also supports dictionaries[1], and in a roughly similar manner--you feed the encoder/decoder the dictionary data purely for the side-effect of updating internal state, without generating any output. But just as zstd supports much larger windows, it also supports much larger dictionaries.

    [1] There just aren't any good open source tools for creating that dictionary (but see https://blog.cloudflare.com/improving-compression-with-prese...), few examples of how to use the low-level library APIs to manage dictionaries (see, e.g., deflateSetDictionary at https://www.zlib.net/manual.html#Advanced), and common utilities don't expose this functionality, either.

So the problem to solve is "which of x1 and x2 is x more similar to", then? It feels like this is not what LLMs are solving so it doesn't surprise me that you can outperform them.

What if x1 is in English and x is the same document in Hebrew? Would an LLM fare better?

Nitpick: it's not zero-shot but few-shot - it still needs a training set with prototypes to go off

  • Yes, it needs prototypes (i.e., possible neighbors) -- but it doesn't need any kind of pretraining to learn to map samples to a feature space.

    It uses gzip instead of a pretrained model to map samples to the feature space.

Fascinating. Thank you for the explanation

I wonder if the same can be done for images

Been playing with images lately and outputting jpegs. It’s amazing how the exact same base pixels, can generate so many different images, but the more noisy/random the images get, the bigger their jpg file size is, and conversely, the more photo-like the image, smaller the jpg size

  • I doubt it will work as well for image classification, because the statistical regularities in pixel values of two images depicting the same object can differ significantly. For example, the statistical regularities of pixel values depicting a black cat at night are more similar to those depicting a black dog at night than those depicting a white cat in the middle of the day, making it difficult to use the compressed statistical regularities to classify cats vs. dogs correctly.

    • I’m sure not all algorithms work for everything, but if it can group images by some sort of meaning, then it can definitely do categorization

      In the case of text documents you might have similar situations, and the algorithm still works, even across languages

      I guess without trying, we’ll never know