← Back to context

Comment by achierius

13 days ago

While the other commentator is correct -- you can't just choose arbitrary floating-point formats if you want to run performantly on existing hardware -- there is some variety to choose from once you get down to the lower precisions. At 16 bits you can take either the standard IEEE fp16 format (1/5/10) or the exponent-heavy bf16 (1/8/7); for 8 bits, there technically is no IEEE specification, but in practice the E5M2 format (1/5/2) serves as "IEEE-equivalent" while E4M3 (1/4/3) takes some liberties with NaNs and drops infinities altogether -- and both are supported on recent Nvidia GPUs.

So between these four you honestly cover _most_ of the desired solution space: e.g. it's hard to imagine wanting to give up more of the mantissa than you already do on E5M2, while E4M3 is already at the lower bound of dynamic range before you need to start giving up IEEE compatability (which can definitely be a pain). There's some room left at the fp16 level but in practice bf16 was already designed for use in neural networks, so in practice people are happy using it for training and then leaving inference to fp16 (which has higher precision).

The only thing that's missing is support for more esoteric formats, e.g. fp4 (E2M1, E3M0) and maybe packed ternary.