Comment by visarga

10 months ago

Very interesting trick, using a dictionary of basis vectors which are quickly computed from a seed without storage. But the result is the same 3 or 4 bit quantization, with only a slight improvement. Their tiles are small, just 8 or 12 weights, it's why compression doesn't go too far. It would have been great if this trick lowered quantization <1 bit/weight, that would require longer tiles. Wondering what are the limits if we use a larger reservoir of cheap entropy as part of neural net architecture, even in training.

Congrats to Apple and Meta, makes sense they did the research, this will go towards efficient serving of LLMs on phones. And it's very easy to implement.

8 comments

visarga

kingsleyopara 10 months ago

I was about to post something similar. While the research is interesting, it doesn’t offer any advantages over 3- or 4-bit quantization. I also have to assume they explored using longer tiles but found it to be ineffective — which would make sense to me from an information theory perspective.

timschmidt 10 months ago
> it doesn’t offer any advantages over 3- or 4-bit quantization.
"zero-shot accuracy retention at 4- and 3-bit compression to be on par with or better than state-of-the-art methods, while maintaining performance comparable to FP16 baselines."
My reading of that says FP16 accuracy at Q3 or Q4 size / memory bandwidth. Which is a huge advantage.
- kingsleyopara 10 months ago
  
  For zero-shot accuracy from Table 3:
  * LLaMA 3 8B: baseline 72.26, 4-bit 71.31, 3-bit 62.79
  * LLaMA 3 70B: baseline 79.51, 4-bit 78.06, 3-bit 74.68
  These results seem comparable to modern quantization methods—for example, the ~4-bit results for smaller LLaMA models listed here: https://ai.meta.com/blog/meta-llama-quantized-lightweight-mo...
  
  2 replies →
jsenn 10 months ago

I think the main advantage is that you can compute the extra parameters (the PRNG seeds) from the network weights alone, whereas most other quantization methods require simulating the quantization procedure at training time (Quantization-Aware Training) or setting them from a calibration dataset (Post-Training Quantization)
hedgehog 10 months ago

This technique has three significant advantages over popular low bit quantization: 1) it retains more accuracy, 2) it does not require calibration data, 3) it's easier to implement in hardware.

samus 10 months ago

It should be definitely worth it because you can reuse databases of sequence to seed mappings for all future models.