Comment by yusufozkan

14 days ago

> while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs

I thought they used a lot more GPUs to train frontier models (e.g. xAi training on 100k). Can someone explain why they are using so few?

I don't want to hunt the details on each of theses releases, but

* You can use less GPUs if you decrease batch size and increase number of steps, which would lead to a longer training time

* FP8 is pretty efficient, if Grok was trained with BF16 then LLama 4 should could need less GPUs because of that

* Depends also on size of the model and number of tokens used for training, unclear whether the total FLOPS for each model is the same

* MFU/Maximum Float Utilization can also vary depending on the setup, which also means that if you're use better kernels and/or better sharding you can reduce the number of GPUs needed