Comment by api

7 hours ago

Read the headline and thought it rescaled LLMs down for your hardware. That would be fascinating but would degrade performance.

Any work on that? Like let’s say I have 64GB memory and I want to run a 256 parameter model. At 4 bit quantized that’s 128 gigs and usually works well. 2 bits usually degrades it too much. But if you could lose data instead of precision? Would probably imply a fine tuning run afterword, so very compute intensive.