← Back to context

Comment by benreesman

15 hours ago

NVFP4 is the thing no one saw coming. I wasn't watching the MX process really, so I cast no judgements, but it's exactly what it sounds like, a serious compromise in resource constrained settings. And it's in the silicon pipeline.

NVFP4 is to put it mildly a masterpiece, the UTF-8 of its domain and in strikingly similar ways it is 1. general 2. robust to gross misuse 3. not optional if success and cost both matter.

It's not a gap that can be closed by a process node or an architecture tweak: it's an order of magnitude where the polynomials that were killing you on the way up are now working for you.

sm_120 (what NVIDIA's quiet repos call CTA1) consumer gear does softmax attention and projection/MLP blockscaled GEMM at a bit over a petaflop at 300W and close to two (dense) at 600W.

This changes the whole game and it's not clear anyone outside the lab even knows the new equilibrium points, it's nothing like Flash3 on Hopper, lotta stuff looks FLOPs bound, GDDR7 looks like a better deal than HBMe3. The DGX Spark is in no way deficient, it has ample memory bandwidth.

This has been in the pipe for something like five years and even if everyone else started at the beginning of the year when this was knowable, it would still be 12-18 months until tape out. And they haven't started.

Years Until Anyone Can Compete With NVIDIA is back up to the 2-5 it was 2-5 years ago.

This was supposed to be the year ROCm and the new Intel stuff became viable.

They had a plan.

This reads like a badly done, sponsored hype video on YouTube.

So if we look at what NVIDIA has to say about NVFP4 it sure sounds impressive [1]. But look closely that initial graph never compares fp8 and fp4 on the same hardware. They jump from H100 to B200 while implying a 5x jump of going with fp4 which it isn't. Accompanied with scary words like if you use MXFP4 "Risk of noticeable accuracy drop compared to FP8" .

Contrast that with what AMD has to say on the open MXFP4 approach which is quite similar to NVFP4 [2]. Ohh the horrors of getting 79.6 instead of 79.9 on GPQA Diamond when using MXFP4 instead of FP8.

[1] https://developer.nvidia.com/blog/introducing-nvfp4-for-effi...

[2] https://rocm.blogs.amd.com/software-tools-optimization/mxfp4...