Comment by melodyogonna

10 hours ago

> Why should they? CUDA is a GPGPU paradigm, AMD/Apple/Intel all ship diverse raster-focused hardware, and TPUs are a systolic array. How much can you realistically expect to abstract with unified primitives?

Ah, it seems impossible to you. These are very different hardwares... It is hard enough to make compatibility among different hardwares of the same vendor. Very difficult to imagine building primitives for hardwares with completely different memory layouts.

> How much performance do you perceive to be left on the table with native CUDA-based implimentations?

Zero is the idea. And I wasn't saying there should be a native cuda-based implementation, I'm asking you to imagine how much easier everything would have been if Cuda was cross-platform without any performance or ergonomic penalties.

Mojo is a foundational step here. The big HOW is powerful parametric programming. So much information could be passed during compile time which the compiler uses to specialize.

1 comment

melodyogonna

bigyabai 4 hours ago

> Ah, it seems impossible to you. These are very different hardwares...

In effect, they are completely different hardware. The only thing any of them have in common is rasterization primitives, so unless you're focused on render workloads you're nearly better off software-accelerating the language for CPUs instead. As a point of comparison, go look at early ray tracing implementations on GPUs that have no dedicated RT blocks or hardware-accelerated denoising. They were oftentimes slower than software-accelerating the same thing on a cheaper CPU.

> I'm asking you to imagine how much easier everything would have been if Cuda was cross-platform

...you are aware of what CUDA actually does, right? Mojo is not a cross-platform version of it. At risk of repeating myself, creating an OpenCL-style library without a Khronos-style consortium does not address the problem. CUDA is a hardware solution, you need to define standards around GPGPU programming because that's what Nvidia does internally. Ignoring industry-wide standardization is the express bet that Nvidia and Google are making, investing in Mojo. Without any hardware stakeholders, Mojo's only opportunity is to become a DirectML/ONNX-style middleware like @pjmlp suggested. Spoiler: that's not a super disruptive or successful goal. Certainly not an LLVM scale opportunity.

> So much information could be passed during compile time which the compiler uses to specialize.

Like what? None of this is a rhetorical question, explain to me what information would help bring GPUs and TPUs up to parity with CUDA. What information are you thinking of that isn't currently expressed in SPIR-V and MLIR? What optimizations do you have in mind?