Comment by benreesman

2 months ago

sm_120 (aka 1CTA) supports tensor cores and TMEM just fine: example 83 shows block-scaled NVFP4 (I've gotten 1850 ish dense TFLOPs at 600W, the 300W part caps out more like 1150). sage3 (which is no way in hell from China, myelin knows it by heart) cracks a petaflop in bidirectional noncausal.

The nvfuser code doesn't even call it sm_100 vs. sm_120: NVIDIA's internal nomenclature seems to be 2CTA/1CTA, it's a bin. So there are less MMA tilings in the released ISA as of 13.1 / r85 44.

The mnemonic tcgen05.mma doesn't mean anything, it's lowered onto real SASS. FWIW the people I know doing their own drivers say the whole ISA is there, but it doesn't matter.

The family of mnemonics that hits the "Jensen Keynote" path is roughly here: https://docs.nvidia.com/cuda/parallel-thread-execution/#warp....

10x path is hot today on Thor, Spark, 5090, 6000, and data center.

Getting it to trigger reliably on real tilings?

Well that's the game just now. :)

Edit: https://customer-1qh1li9jygphkssl.cloudflarestream.com/1795a...

3 comments

benreesman

kouteiheika 2 months ago

Wait, so are you telling me all of the hardware/ISA is actually fully accessible and functional, and it's just an artificial PTX -> SASS compiler limitation?

Because the official NVidia stance is definitely that TMEM, etc. is not supported and doesn't work.

...I don't suppose you have a link to a repo with code that can trigger any of this officially forbidden functionality?

benreesman 2 months ago
I'm telling your it works now. It's just not called `tcgen05`.
Put this in nsight compute: https://github.com/NVIDIA/cutlass/blob/main/examples/79_blac...
(I said 83, it's 79).
If you want to know what NVIDIA really thinks, watch this repo: https://github.com/nVIDIA/fuser. The Polyhedral Wizards at play. All the big not-quite-Fields players are splashing around there. I'm doing lean4 proofs of a bunch of their stuff. https://v0-straylight-papers-touchups.vercel.app
It works now. It's just not the PTX mnemonic that you want to see.
- kouteiheika 2 months ago
  
  Very interesting! Thanks! I'll definitely keep a close eye on that repo.
  Anyhow, be that as it may, I was talking about the PTX mnemonics and such because I'd like to use this functionality from my own, custom kernels, and not necessarily only indirectly by triggering whatever lies at the bottom of NVidia's abstraction stack.
  So what's your endgame with your proofs? You wrote "the breaking point was implementing an NVFP4 matmul" - so do you actually intend to implement an NVFP4 matmul? (: If so I'd be very much interested; personally I'm definitely still in the "cargo-cults from CUTLASS examples" camp, but would love something more principled.