This technique showed that there are ways during training to optimize weights to neatly quantize while remaining performant. This isn't a post training quantization like int4.
For Kimi quantization is part of the training also. Specifically they say they use QAT, quantization aware training.
That doesn't mean training with all integer math, but certain tricks are used to specifically plan for the end weight size. I.e. fake quantization nodes are inserted to simulate int4.
Iirc the paper was solid, but it still hasn’t been adopted/proven out at large scale. Harder to adapt hardware and code kernels to something like this compared to int4.
This technique showed that there are ways during training to optimize weights to neatly quantize while remaining performant. This isn't a post training quantization like int4.
For Kimi quantization is part of the training also. Specifically they say they use QAT, quantization aware training.
That doesn't mean training with all integer math, but certain tricks are used to specifically plan for the end weight size. I.e. fake quantization nodes are inserted to simulate int4.
Iirc the paper was solid, but it still hasn’t been adopted/proven out at large scale. Harder to adapt hardware and code kernels to something like this compared to int4.
just call it one trit