Comment by grohan

2 days ago

Bellard has trained various models, so it may not be the specific 169M parameter LLM, but his Transformer-based `nncp` is indeed #1 on the "Large Text Compression Benchmark" [1], which correctly accounts for both the total size of compressed enwik9 + decompresser size (zipped).

There is no unfair advantage here. This was also achieved in the 2019-2021 period; it feels safe to say that Bellard could have likely pushed the frontier far further with modern compute/techniques.

[1] https://www.mattmahoney.net/dc/text.html

Okay, that's a much better claim. nncp has sizes of 15.5MB and 107MB including the decompressor. The one that's linked, ts_zip, has sizes of 13.8MB and 135MB excluding the decompressor. And it's from 2023-2024.