Comment by joaogui1
14 days ago
I don't want to hunt the details on each of theses releases, but
* You can use less GPUs if you decrease batch size and increase number of steps, which would lead to a longer training time
* FP8 is pretty efficient, if Grok was trained with BF16 then LLama 4 should could need less GPUs because of that
* Depends also on size of the model and number of tokens used for training, unclear whether the total FLOPS for each model is the same
* MFU/Maximum Float Utilization can also vary depending on the setup, which also means that if you're use better kernels and/or better sharding you can reduce the number of GPUs needed
No comments yet
Contribute on Hacker News ↗