Comment by GuB-42

2 years ago

Generally, compression = model + entropy coding.

The model's job is to predict what comes next. The entropy coder's job is to encode the difference between the prediction and what actually comes next so that the most likely outcome uses as few bits as possible. The more accurate the model is, the less the difference between reality and prediction, the less bits the entropy coder needs and the better the compression.

Simple compression algorithms have simple models, like "if I see the same byte 10 times, the 11th is likely to be the same". But you can also use a LLM as your model, as completing text with the most likely word is what LLMs do.

Here they did the opposite. Instead of using a model for compression, by using a few tricks, they used a compression algorithm as a model: the most likely outcome is when the compression algorithm uses less bits to encode the result. And the original authors have shown that, in some tasks, the simple model that can be extracted out of gzip beats much more complex LLMs.

I almost feel like compression and embeddings are duals of each other, but I can't quite articulate it.

Embeddings use fixed-size vectors to minimize the dot product between vectors of similar inputs. Compressors use a variable-length encoding to minimize the overall stored size.

  • they are in a way. both encode representation of larger amounts of information on a smaller footprint.

Generally compression algorithm try to give structured data a distribution more similar to random data.

If any byte sequence is a correct file (unlikely, but mostly because compression algorithms try to be robust against corruption), then this is easy to reverse, you just generate a random sequence of bytes and then decompress it.

Basically you can turn a compression algorithm into a probability distribution by inserting random bytes wherever the decompression algorithm tries to read one, but sometimes not all bytes are allowed.

You can then reason about this probability distribution and see what it's properties are. Typically something with a probability of 'p' will require -log(p)/log(2) bits.

Why gzip, and not a more complex compression algorithm like 'xz'? Or if you want it to be fast then 'zstd' or 'lz4'. Gzip is an odd choice: is it neither the highest compression ratio, nor the fastest.