Comment by WhitneyLand

13 days ago

Interesting.

If this were the case however, why would labs go through the trouble of distilling their smaller models rather than releasing quantized versions of the flagships?

You can't quantize 1T model down to "flash" model speed/token price. 4bpw is about the limit of reasonable quantization, so 2-4x (fp8/16 -> 4bpw) weight size reduction. Easier to serve, sure, but maybe not offer as free tier cheap.

With distillation you're training new model, so size of it is arbitrary, say 1T -> 20B (50x) reduction which also can be quantized. AFAIK distillation is also simply faster/cheaper than training from scratch.

Hanlon's razor.

"Never attribute to malice that which is adequately explained by stupidity."

Yes, I'm calling labs that don't distill smaller sized models stupid for not doing so.