Comment by yusufozkan
14 days ago
> while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs
I thought they used a lot more GPUs to train frontier models (e.g. xAi training on 100k). Can someone explain why they are using so few?
14 days ago
> while pre-training our Llama 4 Behemoth model using FP8 and 32K GPUs
I thought they used a lot more GPUs to train frontier models (e.g. xAi training on 100k). Can someone explain why they are using so few?
I don't want to hunt the details on each of theses releases, but
* You can use less GPUs if you decrease batch size and increase number of steps, which would lead to a longer training time
* FP8 is pretty efficient, if Grok was trained with BF16 then LLama 4 should could need less GPUs because of that
* Depends also on size of the model and number of tokens used for training, unclear whether the total FLOPS for each model is the same
* MFU/Maximum Float Utilization can also vary depending on the setup, which also means that if you're use better kernels and/or better sharding you can reduce the number of GPUs needed