Comment by fracus
7 months ago
I'm curious why a 10GB file of all zeroes would compress only to 10MB. I mean theoretically you could compress it to one byte. I suppose the compression happens on a stream of data instead of analyzing the whole, but I'd assume it would still do better than 10MB.
A compressed file that is only one byte long can only represent maximally 256 different uncompressed files.
Signed, a kid in the 90s who downloaded some "wavelet compression" program from a BBS because it promised to compress all his WaReZ even more so he could then fit moar on his disk. He ran the compressor and hey golly that 500MB ISO fit into only 10MB of disk now! He found out later (after a defrag) that the "compressor" was just hiding data in unused disk sectors and storing references to them. He then learned about Shannon entropy from comp.compression.research and was enlightened.
> He found out later (after a defrag) that the "compressor" was just hiding data in unused disk sectors and storing references to them
So you could access the files until you wrote more data to disk?
Strange to think that is approach would actually work pretty damn well for most people because most people aren’t using therefore hard drive space
Ha ha, that compressor is some evil genius.
Brings to mind this 30+ year old IOCCC entry for compressing C code by storing the code in the file names.
https://www.ioccc.org/1993/lmfjyh/index.html
man, a comment that brings back memories. you and me both.
It has to cater for any possible input. Even with special case handling for this particular (generally uncommon) case of vast runs of the same value: the compressed data will probably be packetized somehow, and each packet can reproduce only so many repeats, so you'll need to repeat each packet enough times to reproduce the output. With 10 GB, it mounts up.
I tried this on my computer with a couple of other tools, after creating a file full of 0s as per the article.
gzip -9 turns it into 10,436,266 bytes in approx 1 minute.
xz -9 turns it into 1,568,052 bytes in approx 4 minutes.
bzip2 -9 turns it into 7,506 (!) bytes in approx 5 minutes.
I think OP should consider getting bzip2 on the case. 2 TBytes of 0s should compress nicely. And I'm long overdue an upgrade to my laptop... you probably won't be waiting long for the result on anything modern.
The reason why the discussion in this thread centers around gzip (and brotli / zstd) is because those are standard compression schemes that HTTP clients will generally support (RFCs 1952, 7932, and 8478).
As far as I can tell, the biggest amplification you can get out of zstd is 32768 times: per the standard, the maximum decompressed block size is 128KiB, and the smallest compressed block is a 3-byte header followed by a 1-byte block (e.g. run-length-encoded). Indeed, compressing a 1GiB file of zeroes yields 32.9KiB of output, which is quite close to that theoretical maximum.
Brotli promises to allow for blocks that decompress up to 16 MiB, so that actually can exceed the compression ratios that bzip2 gives you on that particular input. Compressing that same 1 GiB file with `brotli -9` gives an 809-byte output. If I instead opt for a 16 GiB file (dd if=/dev/zero of=/dev/stdout bs=4M count=4096 | brotli -9 -o zeroes.br), the corresponding output is 12929 bytes, for a compression ratio of about 1.3 million; theoretically this should be able to scale another 2x, but whether that actually plays out in practice is a different matter.
(The best compression for brotli should be available at -q 11, which is the default, but it's substantially slower to compress compared to `brotli -9`. I haven't worked out exactly what the theoretical compression ratio upper bound is for brotli, but it's somewhere between 1.3 and 2.8 million.)
Also note that zstd provides very good compression ratios for its speed, so in practice most use cases benefit from using zstd.
That's a good point, thanks - I was thinking of this from the point of view of the client downloading a file and then trying to examine it, but of course you'd be much better off fucking up their shit at an earlier stage in the pipeline.
I get your point(and have no idea why it isn't compressed more), but is the theoretical value of 1 byte correct? With just one single byte, how does it know how big should the file be after being decompressed?
In general, this theoretical problem is called the Kolmogorov Complexity of a string: the size of the smallest program that outputs a the input string, for some definition of "program", e.g., an initial input tape for a given universal turing machine. Unfortunately, Kolmogorov Complexity in general is incomputable, because of the halting problem.
But a gzip decompressor is not turing-complete, and there are no gzip streams that will expand to infinitely large outputs, so it is theoretically possible to find the pseudo-Kolmogorov-Complexity of a string for a given decompressor program by the following algorithm:
Let file.bin be a file containing the input byte sequence.
1. BOUNDS=$(gzip --best -c file.bin | wc -c)
2. LENGTH=1
3. If LENGTH==BOUNDS, run `gzip --best -o test.bin.gz file.bin` and HALT.
4. Generate a file `test.bin.gz` LENGTH bytes long containing all zero bits.
5. Run `gunzip -k test.bin.gz`.
6. If `test.bin` equals `file.bin`, halt.
7. If `test.bin.gz` contains only 1 bits, increment LENGTH and GOTO 3.
8. Replace test.bin.gz with its lexicographic successor by interpreting it as a LENGTH-byte unsigned integer and incrementing it by 1.
9. GOTO 5.
test.bin.gz contains your minimal gzip encoding.
There are "stronger" compressors for popular compression libraries like zlib that outperform the "best" options available, but none of them are this exhaustive because you can surely see how the problem rapidly becomes intractable.
For the purposes of generating an efficient zip bomb, though, it doesn't really matter what the exact contents of the output file are. If your goal is simply to get the best compression ratio, you could enumerate all possible files with that algorithm (up to the bounds established by compressing all zeroes to reach your target decompressed size, which makes a good starting point) and then just check for a decompressed length that meets or exceeds the target size.
I think I'll do that. I'll leave it running for a couple days and see if I can generate a neat zip bomb that beats compressing a stream of zeroes. I'm expecting the answer is "no, the search space is far too large."
I'm an idiot, of course the search space is too large. It outgrows what I can brute force by the heat death of the universe by the time it gets to 16 bytes, even if the "test" is a no-op.
I would need to selectively generate grammatically valid zstd streams for this to be tractable at all.
It’s a zip bomb, so does the creator care? I just mean from a practical standpoint - overflows and crashes would be a fine result.
Good question. The "ultimate zip bomb" looks something like https://github.com/iamtraction/ZOD - this produces the infamous "42.zip" file, which is about 42KiB, but expands to 3.99 PiB (!).
There's literally no machine on Earth today that can deal with that (as a single file, I mean).
> There's literally no machine on Earth today that can deal with that (as a single file, I mean).
Oh? Certainly not in RAM, but 4 PiB is about 125x 36TiB drives (or 188x 24TiB drives). (You can go bigger if you want to shell out tens of thousands per 100TB SSD, at which point you "only" need 45 of those drives.)
These are numbers such that a purpose-built server with enough SAS expanders could easily fit that within a single rack, for less than $100k (based on the list price of an Exos X24 before even considering any bulk discounts).
I think you can rent a server with about 4.5 PiB from OVH - as a standard product offering, not even a special request. It costs a lot, obviously.
1 reply →
That's far from the ultimate zip bomb.
42.zip has five layers. But you can make a zip file that has an infinite number of layers. See https://research.swtch.com/zip or https://alf.nu/ZipQuine
Do must unzip programs work recursively by default?
No, at least not the ones I am aware of. iirc these kinds of attacks usually targeted content scanners (primarily antivirus). And an AV program would of course have to recursively de compress everything
It'd have to be more than one byte. There's the central directory, zip header, local header then the file itself you need to also tell it how many zeros to make when decompressing the actual file but most compression algorithms don't work like that because they're designed for actual files not essentially blank files so you get larger than the absolute minimum compression.
I mean, if I make a new compression algorithm that says a 10GB file of zeros is represented with a single specific byte, that would technically be compression.
All depends on how much magic you want to shove into an "algorithm"
If it's not standard I count the extra program required to decompress it as part of the archive.
1 reply →
There probably aren’t any perfectly lossless compression algorithms, I guess? Nothing would ever be all zeroes, so it might not be an edge case accounted for or something? I have no idea, just pulling at strings. Maybe someone smarter can jump in here.
No lossless algorithm can compress all strings; some will end up larger. This is a consequence of the pigeonhole principle.
It requires at leadt few bytes, there is no way to represent 10GB of data in 8 bits.
But of course there is. Imagine the following compression scheme:
Of course this is an artificial example, but theoretically it's perfectly sound. In fact, I think you could get there with static huffman trees supported by some formats, including gzip.
What you suggest is saving the information somewhere else and putting a number to represent it. That is not compression, that is mapping. By using this logic, one can argue that one bit is enough as well.
> 254 followed by 0: output 254
126, surely?
There's around a 64KB block size limit for a block of compressed data. That sets a max compression ratio.
gzip isn't optimal for this case. It divides the file into blocks and each one has a header. Apparently that's about 1 byte per 1000.