← Back to context

Comment by goodside

6 years ago

Not to diminish what a cool idea this is, but isn’t it cheating to not count the size of the GPT2 parameters as part of the final compression ratio?

Assuming the decompressor already has GPT2 weights is analogous to assuming it has a massive fixed dictionary of English words and phrases and doing code substitution — it’s likely the pragmatic answer in some scenario, but it’s not a fair basis for comparison. Real-world compressors use dictionary coders, but they build the dictionary specifically for the data when it’s compressed and then count that dictionary in the compressed size. For competitions like the Hutter Compression Prize (1GB of English Wikipedia) the reported size includes the complete binary of the decompressor program too.

GPT2 model weights require over 5GB of storage, so you’d need a corpus orders of magnitude larger for it to be even close to competitive by that standard. And it appears it would lose anyway — the OP claims ~15% ratio even with “cheating”, and the current Hutter Prize winner for 1GB of enwiki is ~11% without “cheating”.

Static dictionaries or models in compression algorithms are not “cheating”. Brotli, for example, achieves amazing results with its [static dictionary](https://gist.github.com/klauspost/2900d5ba6f9b65d69c8e).

However, I agree with you on the real-world uselessness of a GPT-based compression algorithm.

  • That’s why I put “cheating” in quotes — it’s pragmatic, but it complicates the comparison into something that can’t be measured in a single number. I grant you that typical bechmarks ignore the static dictionary in comparing Brotli to other compressors, but they also ignore the size of the binary itself. This is because both are assumed to be small and highly general, and GPT2 violates both assumptions. Brotli’s dictionary is 122 KB and covers many natural and programming languages, whereas GPT2 weights are 5 GB and only cover English. No real-world static dictionary is even a thousandth of that size.

    Large static dictionaries exploit a loophole that would make comparisons meaningless if carried to the extreme — you could trivially include the entire benchmark corpus in the decompressor itself and claim your compressed file size is 0 bytes. That’s why the Hutter Prize rules are what they are.

    • The parameters are a 'signature' of previous texts in many ways. Perhaps this is essentially a form of authentication?

Yes, but you can compress arbitrarily many texts of arbitrary size with that 5GB fixed cost. As the number of total bytes compressed goes to infinity, that vanishes in the numerator.

  • In theory, yes, but a constant 5 GB penalty is enormous in practice — orders of magnitude bigger than anything used in the real world. Brotli’s static dictionary is only 122 KB, and covers many natural and programming languages beyond just English.

    • > a constant 5 GB penalty is enormous in practice > orders of magnitude bigger than anything used in the real world

          [root@archlinux mail]# pwd
          /mail
          [root@archlinux mail]# du -skh .
          68G     .
      

      And this is a tiny personal mailserver. There's loads of applications where a 5GB penalty* is well below the amount of text you're looking at (wikipedia springs to mind since they're in the same kind of size range for text.)

      1 reply →

    • Not to mention the hardware inefficiencies of a 5 GB dictionary on naive hardware. Poor caches. :(

This is exactly what I was thinking. It seems like he’s just taking advantage of the fact that CJK characters are a lot more information-dense than English letters, and has a turnkey encoder/decoder in the form of the pre-trained GPT-2 language model.

I am, in fact, currently involved in research that uses GPT-2 for format-transforming encryption and we follow the exact same recipe but in reverse. Encrypt a message, then "decompress" the random bits into English text using GPT-2. Assuming the receiver has the same GPT-2 state (model parameters and prompt) then they will reproduce the same random bits by arithmetic compression, which can then be decrypted with the shared key.

  • At some point, isn't this just Shannon coding that takes advantage of higher-order probabilities?

No, having an arbitrarily complex dictionary or compressor is not counted as cheating. The model is basically that you are allowed to grab as many harddrives as you want before going out into the wilderness. From then on all your news arrive over a low-bandwidth carrier pigeon, and you have to decompress the transmissions with what you remembered to bring.

  • Counted by whom? What benchmark follows the model you’re describing? Does any real-world compressor use dictionaries anywhere near this big?

    If you can bring the complete benchmark corpus (or substantial subsets of it) “into the wilderness”, the benchmark isn’t worth running. It’s not a compressor, it’s a database with stable keys. A Library of Congress LCCN code uniquely identifies the complete text of any published book, but it doesn’t contain a compressed copy of that book.