Comment by user070223

3 months ago

Never trained a model, but the precision confused me as I've never considered how many bits should be reserved for exponent/mentisa. Has anyone architected a model(somehow) such that it has a free hand at using the give bits / choosing the type, or changed types from layer to layer, I mean surely when training for example vision models the first layers deal with the "big(yet simpler) picture"(light/dark, lines etc) where as the last layers are with the fine details etc.

Even though it may not suitable for (existing) hardware impl, it may be advantageous in other place for example in learning rate speed.

2 comments

user070223

apsec112 3 months ago

You can't choose arbitrary bits of mantissa, because what types are allowed is defined by the underlying hardware and instruction set (PTX for Nvidia). People have done some exploration of which layers can be quantized more vs. which need to be kept in higher precision, but this is usually done post-training (at inference time) and is largely empirical.

achierius 3 months ago

While the other commentator is correct -- you can't just choose arbitrary floating-point formats if you want to run performantly on existing hardware -- there is some variety to choose from once you get down to the lower precisions. At 16 bits you can take either the standard IEEE fp16 format (1/5/10) or the exponent-heavy bf16 (1/8/7); for 8 bits, there technically is no IEEE specification, but in practice the E5M2 format (1/5/2) serves as "IEEE-equivalent" while E4M3 (1/4/3) takes some liberties with NaNs and drops infinities altogether -- and both are supported on recent Nvidia GPUs.

So between these four you honestly cover _most_ of the desired solution space: e.g. it's hard to imagine wanting to give up more of the mantissa than you already do on E5M2, while E4M3 is already at the lower bound of dynamic range before you need to start giving up IEEE compatability (which can definitely be a pain). There's some room left at the fp16 level but in practice bf16 was already designed for use in neural networks, so in practice people are happy using it for training and then leaving inference to fp16 (which has higher precision).

The only thing that's missing is support for more esoteric formats, e.g. fp4 (E2M1, E3M0) and maybe packed ternary.