Comment by joaogui1

14 days ago

I don't want to hunt the details on each of theses releases, but

* You can use less GPUs if you decrease batch size and increase number of steps, which would lead to a longer training time

* FP8 is pretty efficient, if Grok was trained with BF16 then LLama 4 should could need less GPUs because of that

* Depends also on size of the model and number of tokens used for training, unclear whether the total FLOPS for each model is the same

* MFU/Maximum Float Utilization can also vary depending on the setup, which also means that if you're use better kernels and/or better sharding you can reduce the number of GPUs needed