Comment by xpct
7 hours ago
Thank you for sharing the link. It's fascinating that the models can only do 10-20% on the hard subset, and I wonder why that is so. The fact that they can only get 30-40% out of the fp8 GEMM seems unintuitive to me, I would've expected a convergence near ~80%.
I'm not entirely up to date with the latest batch, but I've reviewed some of the rollouts in the past and my sense is that the models are surprisingly good at getting correct custom kernels in the happy path, but still weak at sustained/shape-robust workloads. Having to deal with writing the full path from scratch compounded by weird memory layouts, odd sizes, routing, unpacking quantized weights, etc. is definitely challenging.
Also, at least a portion of this you could argue is arbitrary and entirely scoped to the eval itself. The fp8 GEMM score could be low simply because one of the shapes is fairly skinny (i.e. not enough math work to keep the compute engine busy for a meaningful amount of time).