Comment by jasonjmcghee

5 hours ago

> if you want to understand the effects of quantization on model quality, it's really easy to spin up a GPU server instance and play around

Fwiw, not necessarily. I've noticed quantized models have strange and surprising failure modes where everything seems to be working well and then does a death spiral repeating a specific word or completely failing on one task of a handful of similar tasks.

8-bit vs 4-bit can be almost imperceptible or night and day.

This isn't something you'd necessarily see playing around, but when trying to do something specific