Comment by Gracana

1 month ago

The level of deceit you're describing is kind of ridiculous. Anybody talking about their specific setup is going to be happy to tell you the model and quant they're running and the speeds they're getting, and if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around.

1 comment

Gracana

jasonjmcghee 1 month ago

> if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around

Fwiw, not necessarily. I've noticed quantized models have strange and surprising failure modes where everything seems to be working well and then does a death spiral repeating a specific word or completely failing on one task of a handful of similar tasks.

8-bit vs 4-bit can be almost imperceptible or night and day.

This isn't something you'd necessarily see playing around, but when trying to do something specific