← Back to context

Comment by skrebbel

2 years ago

Can anyone explain to me how a compression algorithm can beat an LLM at anything? Isn’t that like saying horses are better than graffiti?

I’m sure the answer is in there somewhere but I’m not well versed in AI and I simply can’t figure it out.

Generally, compression = model + entropy coding.

The model's job is to predict what comes next. The entropy coder's job is to encode the difference between the prediction and what actually comes next so that the most likely outcome uses as few bits as possible. The more accurate the model is, the less the difference between reality and prediction, the less bits the entropy coder needs and the better the compression.

Simple compression algorithms have simple models, like "if I see the same byte 10 times, the 11th is likely to be the same". But you can also use a LLM as your model, as completing text with the most likely word is what LLMs do.

Here they did the opposite. Instead of using a model for compression, by using a few tricks, they used a compression algorithm as a model: the most likely outcome is when the compression algorithm uses less bits to encode the result. And the original authors have shown that, in some tasks, the simple model that can be extracted out of gzip beats much more complex LLMs.

  • I almost feel like compression and embeddings are duals of each other, but I can't quite articulate it.

    Embeddings use fixed-size vectors to minimize the dot product between vectors of similar inputs. Compressors use a variable-length encoding to minimize the overall stored size.

    • they are in a way. both encode representation of larger amounts of information on a smaller footprint.

  • Generally compression algorithm try to give structured data a distribution more similar to random data.

    If any byte sequence is a correct file (unlikely, but mostly because compression algorithms try to be robust against corruption), then this is easy to reverse, you just generate a random sequence of bytes and then decompress it.

    Basically you can turn a compression algorithm into a probability distribution by inserting random bytes wherever the decompression algorithm tries to read one, but sometimes not all bytes are allowed.

    You can then reason about this probability distribution and see what it's properties are. Typically something with a probability of 'p' will require -log(p)/log(2) bits.

  • Why gzip, and not a more complex compression algorithm like 'xz'? Or if you want it to be fast then 'zstd' or 'lz4'. Gzip is an odd choice: is it neither the highest compression ratio, nor the fastest.

A language model estimates the probability of a sequence of words P(w_1, ..., w_n) or equivalently P(word | context).

For compression, word sequences that have higher probability should be encoded with shorter codes, so there is a direct relationship. A well known method to construct such codes based on probabilities is Huffman coding.

This works whether you use a statistical language model using word frequencies or an LLM to estimate probabilities. The better your language model (lower perplexity) the shorter the compressed output will be.

Conversely, you can probably argue that a compression algorithm implicitly defines a language model by the code lengths, e.g., it assumes duplicate strings are more likely than random noise.

The intuition about how the gzip method works goes like so:

If you compress `ABC`, it will be X bytes. If you then compress `ABCABC`, it will not take 2x bytes. The more similar the two strings that you concatenate, the less bytes it will take. `ABCABD` will take more than `ABCABC`, but less than `ABCXYZ`.

BERT is, by todays standards, a very small LLM, which we know has weaker performance than the billion-param scale models most of us are interacting with today.

  • > very small LLM

    Heh. So does that make it a MLM (medium)?

    I've always found it funny that we've settled on a term for a class of models that has a size claim... Especially given how fast things are evolving...

It's a very limited task: take a document and classify it into one of (for example) 10 or so categories. Things like detecting certain words can do pretty well in some cases. Things that compress well have the occurrence of common substrings.

LLMs and essentially all neural networks can be viewed as learning compression algorithms where the behavior of the compression algorithm is learned and subject to potential constraints beyond mere file reconstruction.

Highly recommend reading Ted Chiang's "ChatGPT Is a Blurry JPEG of the Web"[0] to get a better sense of this.

Keeping this fact in your mental model neural networks can also go a long way to demystify them.

0. https://www.newyorker.com/tech/annals-of-technology/chatgpt-...

  • (The human brain is also, in part, a blurry JPEG of the world.)

    • We currently have no reason to believe this, and information we do have seems to suggest that is very unlikely to be the case. I'm also guessing from my username you can infer that I don't think we even know enough to concretely say what is this "world" you are referencing.

      2 replies →

Many other replies here are wrong - the primary reason is that the LLMs were used on completely out of distribution data (e.g. trained on English, evaluated on completely different language that shared some characters). The points about compression's relatedness to understanding are valid, but they are not the primary reason for LLMs underperforming relative to naive compression.

Other reply is great, more in-depth on details from me here: https://news.ycombinator.com/item?id=36758681

Plain english TL;DR:

- if you limit your task to binning snippets of text

- and the snippets are very well-defined (ex. code vs. Filipino text)

- the snippets _bytes_ could be compared and score well, no text understanding needed

- size delta of a GZIP after adding one more sentence acts as an ersatz way to compare sets of bytes to eachother (ex. you can imagine a GZIP containing 0xFFABCDEF that has 0xFFABCDEF added to it will have a size delta of 0)

  • Did you read the recent Douglas Hofstadter article or do you just always use the word ersatz?

    • I was homeschooled till 12 and mostly left to my own devices as long as it was reading - I believe that has caused a lifelong issue where I sound like a tryhard unintentionally :( (TL;Dr I use it but IDK when, didn't see hofstader article but now I'm looking forward to it)

      6 replies →