Comment by eigenvalue

3 years ago

I wonder if you could get slightly better results by using zstd and taking advantage of zstd's support for "compression dictionaries" instead of simply concatenating the documents. Then compare the compressed size of a document with the compression dictionary versus without it. I know that zstd is able to achieve significantly higher compression ratios (at least at level 20+) than gzip, so whatever makes this work well with gzip (approximating Kolmogorov complexity?) might work better.

1 comment

eigenvalue

wahern 3 years ago

gzip/deflate also supports dictionaries[1], and in a roughly similar manner--you feed the encoder/decoder the dictionary data purely for the side-effect of updating internal state, without generating any output. But just as zstd supports much larger windows, it also supports much larger dictionaries.

[1] There just aren't any good open source tools for creating that dictionary (but see https://blog.cloudflare.com/improving-compression-with-prese...), few examples of how to use the low-level library APIs to manage dictionaries (see, e.g., deflateSetDictionary at https://www.zlib.net/manual.html#Advanced), and common utilities don't expose this functionality, either.