← Back to context

Comment by Ari_Rahikkala

2 years ago

> Models like ChatGPT aren’t eligible for the Hutter Prize for a variety of reasons, one of which is that they don’t reconstruct the original text precisely—i.e., they don’t perform lossless compression.

Small nit: The lossiness is not a problem at all. Entropy coding turns an imperfect, lossy predictor into a lossless data compressor, and the better the predictor, the better the compression ratio. All Hutter Prize contestants anywhere near the top use it. The connection at a mathematical level is direct and straightforward enough that "bits per byte" is a common number used in benchmarking language models, despite the fact that they are generally not intended to be used for data compression.

The practical reason why a ChatGPT-based system won't be competing for the Hutter Prize is simply that it's a contest about compressing a 1GB file, and GPT-3's weights are both proprietary and take up hundreds of times more space than that.

Fabrice Bellard has a project that does precisly this. And does it extremely well, apparently. Previously on HN: http://www.mattmahoney.net/dc/text.html ) . Not sure why it isn't eligible for the Hutter Prize, there's some speculations in the previous discussion but I don't know whether they're true.

  • Thank you! Turns out that GPT does in fact perform lossless compression if you want it to, like in this demo.

    • The main issue is that most ML frameworks aren't reliably reproducible, and are not designed for such use cases.

      Bellard's solution was to code up his own neural network library in C.