Comment by pavelstoev
14 days ago
Model training observations from both Llama 3 and 4 papers:
Meta’s Llama 3 was trained on ~16k H100s, achieving ~380–430 TFLOPS per GPU in BF16 precision, translating to a solid 38 - 43% hardware efficiency [Meta, Llama 3].
For Llama 4 training, Meta doubled the compute, using ~32K H100s and switched to FP8 precision. Despite the precision gain, observed efficiency dropped to about 19.7%, with GPUs delivering ~390 TFLOPS out of a theoretical 1,979 FP8 TFLOPS [Meta, Llama 4].
I am not the one to critique, and rather, this is a recognition of the enormous complexity of operating GPUs at this scale. Training massive models across tens of thousands of GPUs stretches today’s AI infrastructure to its limit.
Besides accelerating inference workloads, advanced GPU optimizations can be integrated into training and fine-tuning pipelines. From various kernel optimization techniques (over 90) to increasing memory access efficiency and scaling up to cluster-wide resource coordination, efficiency can be maximized with some complex software.
References: [Meta, Llama 3] https://ai.meta.com/research/publications/the-llama-3-herd-o... [Meta, Llama 4] https://ai.meta.com/blog/llama-4-multimodal-intelligence/
That's about the same number for DeepSeek-V3. If you count in fp8 MFU is about 20%. MoEs are hard.
That could also be why they did fp8. If we use theoretical performance of bf16 as baseline (I know this makes few sense, but for compare with previous trainings it's convenient) the about 40% MFU, not too bad.
IOW, MoE kills training MFU and they had to do fp8 to make it not looking funny. Both DeepSeek and Meta GenAI.
It's not just scale. Even for single GPU, it is hard to acheive 2x speed improvement as the GPU specs states. Even NVIDIA's own Tensor Engine acheives 28% extra FLOP/s[1].
[1]: https://arxiv.org/pdf/2310.18313
The H100 theoretical flops number is just marketing, as it relies on sparsity that LLMs don’t use
And the practical flops always end up lower. As an example a V100 has 125 according to spec, but the ideal case is more like 100 and non-ideal like 60.
Never trained a model, but the precision confused me as I've never considered how many bits should be reserved for exponent/mentisa. Has anyone architected a model(somehow) such that it has a free hand at using the give bits / choosing the type, or changed types from layer to layer, I mean surely when training for example vision models the first layers deal with the "big(yet simpler) picture"(light/dark, lines etc) where as the last layers are with the fine details etc.
Even though it may not suitable for (existing) hardware impl, it may be advantageous in other place for example in learning rate speed.
You can't choose arbitrary bits of mantissa, because what types are allowed is defined by the underlying hardware and instruction set (PTX for Nvidia). People have done some exploration of which layers can be quantized more vs. which need to be kept in higher precision, but this is usually done post-training (at inference time) and is largely empirical.
While the other commentator is correct -- you can't just choose arbitrary floating-point formats if you want to run performantly on existing hardware -- there is some variety to choose from once you get down to the lower precisions. At 16 bits you can take either the standard IEEE fp16 format (1/5/10) or the exponent-heavy bf16 (1/8/7); for 8 bits, there technically is no IEEE specification, but in practice the E5M2 format (1/5/2) serves as "IEEE-equivalent" while E4M3 (1/4/3) takes some liberties with NaNs and drops infinities altogether -- and both are supported on recent Nvidia GPUs.
So between these four you honestly cover _most_ of the desired solution space: e.g. it's hard to imagine wanting to give up more of the mantissa than you already do on E5M2, while E4M3 is already at the lower bound of dynamic range before you need to start giving up IEEE compatability (which can definitely be a pain). There's some room left at the fp16 level but in practice bf16 was already designed for use in neural networks, so in practice people are happy using it for training and then leaving inference to fp16 (which has higher precision).
The only thing that's missing is support for more esoteric formats, e.g. fp4 (E2M1, E3M0) and maybe packed ternary.
I think BF16 and FP16 are 1979 TFPOPs, but FP8 is 2x faster at 3958 TFLOPs. So only 10% efficiency, down from 20%. That’s not good.
That’s with sparsity. So it’s 29% down from 40%.