Comment by kevmo314

3 days ago

Isn't this turning a GPU into a slower CPU? It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread. If code is written in a GPU unaware way it's not going to take advantage of the reasons for being on the GPU in the first place.

17 comments

kevmo314

lmeyerov 3 days ago

We have this issue in GFQL right now. We wrote the first OSS GPU cypher query language impl, where we make a query plan of gpu-friendly collective operations... But today their steps are coordinated via the python, which has high constant overheads.

We are looking to shed something of the python<->c++<->GPU overheads by pushing macro steps out of python and into C++. However, it'd probably be way better to skip all the CPU<>GPU back-and-forth by coordinating the task queue in the GPU to beginwith . It's 2026 so ideally we can use modern tools and type as safety for this.

Note: I looked at the company's GitHub and didn't see any relevant oss, which changes the calculus for a team like our's. Sustainable infra is hard!

LegNeato 2 days ago

We are the maintainers of https://github.com/rust-gpu/rust-gpu and https://github.com/Rust-GPU/Rust-CUDA FWIW. We haven't upstreamed the VectorWare work yet as it is still being cleaned up and iterated on.

fooker 3 days ago

> It's not like CPUs are slow, in fact they're quite a bit faster than any single GPU thread.

This was overwhelmingly true ten years ago, not so much now.

Modern GPU threads are about 3Ghz, CPUs are still slightly faster in theory but the larger amounts of local fast memory makes GPU threads pretty competitive in practice.

kevmo314 2 days ago

Are you writing this from the future? The latest gen nvidia gpus sit at around 2-2.5 GHz and the latest gen amd cpus sit 4-5 GHz.
That matches my personal experience too, writing naive cuda code that doesn’t take advantage of parallelism is roughly half the speed of running it on cpu.

imtringued 3 days ago

I've seen this objection pop up every single time and I still don't get it.

GPUs run 32, 64 or even 128 vector lanes at once. If you have a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence, etc how is it supposed to be slower?

Consider the following:

You have a hyperoptimized matrix multiplication kernel and you also have your inference engine code that previously ran on the CPU. You now port the critical inference engine code to directly run on the GPU, thereby implementing paged attention, prefix caching, avoiding data transfers, context switches, etc. You still call into your optimized GPU kernels.

Where is the magical slowdown supposed to come from? The mega kernel researchers are moving more and more code to the GPU and they got more performance out of it.

Is it really that hard to understand that the CUDA style programming model is inherently inflexible and limiting? I think the fundamental problem here is that Nvidia marketing gave an incredibly misleading perception of how the hardware actually works. GPUs don't have thousands of cores like CUDA Core marketing suggests. They have a hundred "barrel CPU"-like cores.

The RTX 5090 is advertised to have 21760 CUDA cores. This is a meaningless number in practice since the "CUDA cores" are purely a software concept that doesn't exist in hardware. The vector processing units are not cores. The RTX 5090 actually has 170 streaming multiprocessors each with their own instruction pointer that you can target independently just like a CPU. The key restriction here is that if you want maximum performance you need to take advantage of all 128 lanes and you also need enough thread copies that only differ in the subset of data they process so that the GPU can switch between them while it is working on multi cycle instructions (memory loads and the like). That's it.

Here is what you can do: You can take a bunch of streaming processors, lets say 8 and use them to run your management code on the GPU side without having to transfer data back to the CPU. When you want to do heavy lifting you are in luck, because you still have 162 streaming processors left to do whatever you want. You proceed to call into cuDNN and get great performance.

Bimos 3 days ago

> a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence
But the library is using a warp as a single thread
winwang 1 day ago
Each SM should have 4 independent SMSPs (32 lanes each), no? Effectively a "4-core" task-parallel system per SM.
- zozbot234 1 day ago
  
  SMSP = Streaming Multiprocessor Sub-Partition, found in recent nVidia architectures - effectively partitioning each Streaming Multiprocessor into multiple complete sub-cores with separate register files and program counters, but accessing the same local memory. (AMD architectures have a similar development, with 'dual' compute units.) This creates overhead when running very large warps, since they can only have access to a fraction of the complete SM. But warps under the VectorWare model should be fairly small (running CPU-like code with fairly limited use of lane parallelism), so this doesn't have that much impact from that POV.
kevmo314 3 days ago

> a block of Rust threads that are properly programmed to take advantage of the vector processing by avoiding divergence
Sure, if you have that then of course it would be fast. But that’s not what this library is proposing.
monideas 3 days ago

I really appreciate the way you've explained this. Are there any resources you recommend to reach your level of understanding?

pjmlp 3 days ago

Additionally there is still too much performance left on the table by not properly using CPU vector units.

fooker 3 days ago
SIMD performance in modern Intel and AMD cpus is so bad that it is useless outside very specific circumstances.
This is mainly because vector instructions are implemented by sharing resources with other parts of the CPU and more or less stalls pipelines, significantly reduces ipc, makes out of order execution ineffective.
The shared resources are often involve floating point registers and compute, so it's a double whammy.
- pjmlp 3 days ago
  
  Yet, it is still faster than not doing nothing, or calling into the GPU, on workloads where the bus traffic takes the majority of execution time.
  
  4 replies →