Comment by rfoo

6 months ago

That's about the same number for DeepSeek-V3. If you count in fp8 MFU is about 20%. MoEs are hard.

That could also be why they did fp8. If we use theoretical performance of bf16 as baseline (I know this makes few sense, but for compare with previous trainings it's convenient) the about 40% MFU, not too bad.

IOW, MoE kills training MFU and they had to do fp8 to make it not looking funny. Both DeepSeek and Meta GenAI.

0 comments

rfoo

No comments yet

Contribute on Hacker News ↗