Comment by lambda

6 days ago

This isn't the paper that I was thinking of, but it shows a similar trend to the one I was looking at. In this particular case, even down to 5 bits showed no measurable reduction in performance (actually a slight increase, but that probably just means that you're withing the noise of what this test can distinguish), then you see performance dropping off rapidly as it gets down to 3 various 3 bit quants: https://arxiv.org/pdf/2601.14277

There was another paper that did a similar test, but with several models in a family, and all the way down to 1 bit, and it was only at 1 bit that it crossed over to having worse performance than the next smaller model. But yeah, I'm having a hard time finding that paper again.

2 comments

lambda

Wowfunhappy 6 days ago

So, why does ChatGPT not use fewer bits? Sure they have big data centers but they still have to pay for those.

lambda 5 days ago

Why do you think ChatGPT doesn't use a quant? GPT-OSS, which OpenAI released as open weights, uses a 4 bit quant, which is in some ways a sweet spot, it loses a small amount of performance in exchange for a very large reduction in memory usage compared to something like fp16. I think it's perfectly reasonable to expect that ChatGPT also uses the same technique, but we don't know because their SOTA models aren't open.
https://arxiv.org/pdf/2508.10925