Comment by Dylan16807
2 days ago
> Presciently, Hutter appears to be absolutely right. His enwik8 and enwik9’s benchmark datasets are, today, best compressed by a 169M parameter LLM
Okay, that's not fair. There's a big advantage to having an external compressor and reference file whose bytes aren't counted, whether or not your compressor models knowledge.
More importantly, even with that advantage it only wins on the much smaller enwiki8. It loses pretty badly on enwiki9.
Bellard has trained various models, so it may not be the specific 169M parameter LLM, but his Transformer-based `nncp` is indeed #1 on the "Large Text Compression Benchmark" [1], which correctly accounts for both the total size of compressed enwik9 + decompresser size (zipped).
There is no unfair advantage here. This was also achieved in the 2019-2021 period; it feels safe to say that Bellard could have likely pushed the frontier far further with modern compute/techniques.
[1] https://www.mattmahoney.net/dc/text.html
Okay, that's a much better claim. nncp has sizes of 15.5MB and 107MB including the decompressor. The one that's linked, ts_zip, has sizes of 13.8MB and 135MB excluding the decompressor. And it's from 2023-2024.
Any manually designed algorithm is external to the compressed data, while also being a model for it. It's just designed manually vs the automatic optimization. I'd say the line is pretty blurred here.
It is also wrong because the current state of the art algorithm for the Hutter prize is 110 Mb large on enwiki9 and also includes the actual compression and decompression logic.
Yep, this is like taking a file, saving a different empty file named as base-64 encoded contents of the first and claim you compressed it down by 100%.
> Okay, that's not fair. There's a big advantage to having an external compressor and reference file whose bytes aren't counted, whether or not your compressor models knowledge.
The benchmark in question (Hutter prize) does count the size of the decompressor/reference file (as per the rules, the compressor is supposed to produce a self-decompressing file).
The article mentions Bellard's work but I don't see his name in the top contenders of the prize, so I'm guessing his attempt was not competitive enough if you take into account the LLM size, as per the rules.
The benchmark counts it but the LLM compressor that was linked in that sentence clearly doesn't count the size.