← Back to context

Comment by janalsncm

1 day ago

Quantization is not lossless.

Nobody really cares if it meets a strict definition of lossless.

  • And when you consider that the usual final step in the pipeline is that a sampler goes ham on the probabilities and just picks some random nonsense, the tolerance for lossy compression is fairly high.

    In fact, there's this funny occurrence where Q4 models on occasion perform better than their fp16 counterparts on benchmarks ran with top_k=1 since the outputs are slightly more random and they can less deterministically blunder past the local maximum into a more correct solution.

  • I do? I spend a ton of time post-training models for creative tasks.

    The effects of model quantization are usually qualified in terms of performance on benchmaxxed tasks with strong logit probabilities, temp 0, and a "right" answer the model has to pick. Or even worse they'll be measured on metrics that don't map to anything except themselves like perplexity (https://arxiv.org/pdf/2407.09141)

    I agree Q8 is strong but I also think the effects of quantization are constantly being underappreciated. People are often talking about how these models perform while fundamentally using 10+ variants of a single model with distinct performance profiles.

    Even knowing the bits per weight used isn't enough to know how exactly a given quant method is affecting the model: https://docs.unsloth.ai/basics/unsloth-dynamic-v2.0-ggufs

  • That's not true. If there are measurable performance differences.

    • "strict" means something. People, including yourself, only care if there is a practical difference in performance. "this is lossless and that isn't lossless" is a completely useless statement in this realm. In many domains lossy compression is either not tolerated, not legal or not practical.