← Back to context

Comment by regularfry

3 days ago

At some point you hit information limits. With conventional quantisation you see marked capability fall-off below q5. All else being equal you'd expect an N-parameter 5-bit quant to be roughly comparable to a 3N-parameter ternary, if they are trained to the same level, just in terms of the amount of information they can possibly hold. So yes, 100B ternary would be within the ballpark of a 30B q5 conventional model, with a lot of hand-waving and sufficiently-smart-training

I assume that theoretically, 1-bit models could be most efficient because modern models switched from 32 bit to 16 bit to 8 bit per parameter (without quantization).

  • It's not clear where the efficiency frontier actually is. We're good at measuring size, we're good at measuring FLOPS, we're really not very good at measuring capability. Because of that, we don't really know yet whether we can do meaningfully better at 1 bit per parameter than we currently get out of quantising down to that size. Probably, is the answer, but it's going to be a while before anyone working at 1 bit per param has sunk as many FLOPS into it as the frontier labs have at higher bit counts.

    • The thing with efficiency is that it is relative to both inference and training compute. If you do quantization, you need a more powerful higher precision model to quantize from, which doesn't exist if you want to create a frontier model. In this case the question is only whether you get better inference and/or training performance from training e.g. a native 1 bit model.

      Currently the optimal training precision seems to be 8 bit (at least used by DeepSeek and some other open weight companies). But this might change with different training methods optimized for 1-bit training, like from this paper I linked before: https://proceedings.neurips.cc/paper_files/paper/2024/hash/7...