Comment by wvenable
2 years ago
JPEG relies of the limitations of human vision to make an image largely indistinguishable from the original. It specifically throws information away that we are less unlikely to notice. So yes, a good JPEG should be indistinguishable (to humans) from the original. Obviously the more you turn up the compression the harder that is.
It's not quite that straight-forward, though, in that there are two competing goals: Small size and looking as similar as possible to the original. We're explicitly willing trade accuracy for size. How much depends on the use, but sometimes we're willing to trade away so much quality that the artefacts are plainly visible. And we're willing to trade more accuracy for size when the artefacts doesn't distract. For some uses compression artefacts are better than misleading about the original, but for other uses, misleading changes would be preferable as long as they don't give fewer noticeable artefacts for a given size.
I don't think you disagree. The point is that JPEG has the constraint: make an image as similar as possible to the source image which not going over x kilobytes. LLMs have no similar constraint, so calling them "compression" is a false analogy; they're not trying to compress information, they're using their dataset to learn general facts about e.g. syntax and culture.
I was really mainly responding to the point of JPEG aiming for indistinguishable. Point being that for a lot of purposes we're fine with, and might even be happier with, very different tradeoffs than those JPEG makes.
Going specifically to AI, we do agree that the lack of constraint means they're not compressors in and of themselves. The training compresses information, but that does not make them compressors. Learning and compressing information is, however, at least in some respects very similar. A key part of the LZW family of compression, for example, is applying heuristics to build a dictionary of bit streams (terms) learned from the input.
AI models can potentially eventually be used at the base of a compression because the models encode a lot of information that can potentially be referenced in space-efficient ways.
E.g. if I have a picture of a sunset, and can find a way of getting Stable Diffusion or similar to generate an image of a sunset that is similar enough from a description smaller than the output image, then I have a compressor and decompressor.
Ignoring the runtime cost and that bringing that down to levels where it'd actually produce a benefit, depending on how close the output it, it may be a totally useless algorithm leading to images that are way too far from the input, or it might turn out pretty good. But the tradeoffs would also be very different from JPEG. For some uses I might be happy with a quite different-looking sunset as long as it's "close enough" and high quality even at very high compression ratios. E.g. "A sunset over the horizon. Photo taken from a beach. A fishing boat in the water" fed to [1] produced a pretty nice sunset. Couple that with a seed to make it deterministic, and I might be happy with that as a compression of an image of a quite different sunset. For other uses I'd much prefer JPEG artefacts and something that is clearly the same sunset. For "real" use of it for compression you'd want someone to research ways of guiding it to produce something much closer to the input (maybe heavily downscaling the original image and using that as the starting point coupled with a description; maybe a set of steps including instructions for infilling etc). I think finding the limits of what you can achieve with trying to use these models to reproduce a specific input with the most minimal possible input would make for fascinating research.
[1] https://huggingface.co/stabilityai/stable-diffusion-2?text=A...