← Back to context

Comment by rl3

12 hours ago

>What would be cool is the dream that's been building for decades about parallel computing abstractions where you write what looks like normal single-threaded CPU code, but it automagically works on SIMD instructions or GPU.

I've had that same dream at various points over the years, and prior to AI my conclusion was that it was untenable barring a very large, world-class engineering team with truckloads of money.

I'm guessing a much smaller (but obviously still world-class!) team now has a shot at it, and if that is indeed what they're going for, then I could understand them perhaps being a bit coy.

It's one heck of a crazy hard problem to tackle. It really depends on what levels of abstraction are targeted, in addition to how much one cares about existing languages and supporting infra.

It's really nice to see a Rust-only shop, though.

Edit: Turns out it helps to RTFA in its entirety:

>>Our approach differs in two key ways. First, we target Rust's std directly rather than introducing a new GPU-specific API surface. This preserves source compatibility with existing Rust code and libraries. Second, we treat host mediation as an implementation detail behind std, not as a visible programming model.

In that sense, this work is less about inventing a new GPU runtime and more about extending Rust's existing abstraction boundary to span heterogeneous systems.

That last sentence is interesting in combination with this:

>>Technologies such as NVIDIA's GPUDirect Storage, GPUDirect RDMA, and ConnectX make it possible for GPUs to interact with disks and networks more directly in the datacenter.

Perhaps their modified std could enable distributed compute just by virtue of running on the GPU, so long as the GPU hardware topology supports it.

Exciting times if some of the hardware and software infra largely intended for disaggregated inference ends up as a runtime for [compiled] code originally intended for the CPU.

There was a library for Rust called “faster” which worked similarly to Rayon, but for SIMD.

The simpleminded way to do what you’re saying would be to have the compiler create separate PTX and native versions of a Rayon structure, and then choose which to invoke at runtime.