Comment by Jaxkr
6 years ago
Static dictionaries or models in compression algorithms are not “cheating”. Brotli, for example, achieves amazing results with its [static dictionary](https://gist.github.com/klauspost/2900d5ba6f9b65d69c8e).
However, I agree with you on the real-world uselessness of a GPT-based compression algorithm.
That’s why I put “cheating” in quotes — it’s pragmatic, but it complicates the comparison into something that can’t be measured in a single number. I grant you that typical bechmarks ignore the static dictionary in comparing Brotli to other compressors, but they also ignore the size of the binary itself. This is because both are assumed to be small and highly general, and GPT2 violates both assumptions. Brotli’s dictionary is 122 KB and covers many natural and programming languages, whereas GPT2 weights are 5 GB and only cover English. No real-world static dictionary is even a thousandth of that size.
Large static dictionaries exploit a loophole that would make comparisons meaningless if carried to the extreme — you could trivially include the entire benchmark corpus in the decompressor itself and claim your compressed file size is 0 bytes. That’s why the Hutter Prize rules are what they are.
The parameters are a 'signature' of previous texts in many ways. Perhaps this is essentially a form of authentication?