Comment by mmaunder

2 months ago

For those of you wondering if this fits your use case vs the RTX 5090 the short answer is this:

The desktop RTX 5090 has 1792 GB/s of memory bandwidth partially due to the 512 bit bus width, compared to the DGX Spark with a 256 bit bus and 273 GB/s memory bandwidth.

The RTX 5090 has 32G of VRAM vs the 128G of “VRAM” in the DGX Spark which is really unified memory.

Also the RTX 5090 has 21760 cuda cores vs 6144 in the DGX Spark. (3.5 x as many). And with the much higher bandwidth in the 5090 you have a better shot at keeping them fed. So for embarrassingly parallel workloads the 5090 crushes the Spark.

So if you need to fit big models into VRAM and don’t care about speed too much because you are for example, building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer.

If you need speed and 32G of VRAM is plenty, and you don’t care about modeling network interconnections in production, then the RTX 5090 is what you want.

7 comments

mmaunder

kouteiheika 2 months ago

> building something on your desktop that’ll run on data center hardware in production, the DGX Spark is your answer

It isn't, because it's a different architecture than the datacenter hardware. They're both called "Blackwell", but that's a lie[1] and you still need "real" datacenter Blackwell card for development work. (For example, you can't configure/tune vLLM on Spark, and then move it into a B200 and even expect it to work, etc.)

[1] -- https://github.com/NVIDIA/dgx-spark-playbooks/issues/22

benreesman 2 months ago
sm_120 (aka 1CTA) supports tensor cores and TMEM just fine: example 83 shows block-scaled NVFP4 (I've gotten 1850 ish dense TFLOPs at 600W, the 300W part caps out more like 1150). sage3 (which is no way in hell from China, myelin knows it by heart) cracks a petaflop in bidirectional noncausal.
The nvfuser code doesn't even call it sm_100 vs. sm_120: NVIDIA's internal nomenclature seems to be 2CTA/1CTA, it's a bin. So there are less MMA tilings in the released ISA as of 13.1 / r85 44.
The mnemonic tcgen05.mma doesn't mean anything, it's lowered onto real SASS. FWIW the people I know doing their own drivers say the whole ISA is there, but it doesn't matter.
The family of mnemonics that hits the "Jensen Keynote" path is roughly here: https://docs.nvidia.com/cuda/parallel-thread-execution/#warp....
10x path is hot today on Thor, Spark, 5090, 6000, and data center.
Getting it to trigger reliably on real tilings?
Well that's the game just now. :)
Edit: https://customer-1qh1li9jygphkssl.cloudflarestream.com/1795a...
- kouteiheika 2 months ago
  
  Wait, so are you telling me all of the hardware/ISA is actually fully accessible and functional, and it's just an artificial PTX -> SASS compiler limitation?
  Because the official NVidia stance is definitely that TMEM, etc. is not supported and doesn't work.
  ...I don't suppose you have a link to a repo with code that can trigger any of this officially forbidden functionality?
  
  2 replies →
my123 2 months ago

Note that sm_110 (Jetson Thor) has the tcgen05 ISA exposed (with TMEM and all) instead of the sm_120 model.

chao- 2 months ago

It's also worth nothing that the 128GB of "VRAM" in the GB10 is even less straightforward than just being aware that the memory is shared with the CPU cores. There's a lot of details in memory performance that differ across both the different core types, and the two core clusters:

https://chipsandcheese.com/p/inside-nvidia-gb10s-memory-subs...