Comment by rfoo

14 days ago

That's about the same number for DeepSeek-V3. If you count in fp8 MFU is about 20%. MoEs are hard.

That could also be why they did fp8. If we use theoretical performance of bf16 as baseline (I know this makes few sense, but for compare with previous trainings it's convenient) the about 40% MFU, not too bad.

IOW, MoE kills training MFU and they had to do fp8 to make it not looking funny. Both DeepSeek and Meta GenAI.