Comment by shihab

1 month ago

To the author (or anyone from vectorware team), can you please give me, admittedly a skeptic, a motivating example of a "GPU-native" application?

That is, where does it truly make a difference to dispatch non-parallel/syscalls etc from GPU to CPU instead of dispatching parallel part of a code from CPU to GPU?

From the "Announcing VectorWare" page:

> Even after opting in, the CPU is in control and orchestrates work on the GPU.

Isn't it better to let CPUs be in control and orchestrate things as GPUs have much smaller, dumber cores?

> Furthermore, if you look at the software kernels that run on the GPU they are simplistic with low cyclomatic complexity.

Again, there's a obvious reason why people don't put branch-y code on GPU.

Genuinely curious what I'm missing.

12 comments

shihab

ukoki 1 month ago

Not OP but I'm currently make a city-builder computer game with a large procedurally-generated world. The terrain height at any point in the world is defined by function that takes a small number of constant parameters, and the horizontal position in the world, to give the height of the terrain at that position.

I need the heights on the GPU so I can modify the terrain meshes to fit the terrain. I need the heights on the CPU so I can know when the player is clicking the terrain and where to place things.

Rather than generating a heightmap on the CPU and passing a large heightmap texture to the GPU I have implemented the identical height generating functions in rust (CPU) and webgl (GPU). As you might imagine, its very easy for these to diverge and so I have to maintain a large set of tests that verify that generated heights are identical between implementations.

Being able to write this implementation once and run it on the CPU and GPU would give me much better guarantees that the results will be the same. (although necause of architecture differences and floating point handling they the results will never be perfect, but I just need them to be within an acceptable tolerance)

xmcqdpt2 1 month ago

That's a good application but likely not one requiring a full standard library on the GPU? Procedurally generated data on GPU isn't uncommon AFAIK. It wasn't when I was dabbling in GPGPU stuff ~10 years ago.
If you wrote in open cl, or via intel libraries, or via torch or arrayfire or whatever, you could dispatch it to both CPU and GPU at will.
moron4hire 1 month ago

There are GPU-based picking algorithms. You really should not have to maintain parallel data generation systems on both the GPU and CPU just to support picking. Maybe you have a different issue that would require it, but picking alone shouldn't be it.

nicman23 1 month ago

in large sim systems p2p comms and not having to involve the cpu in any way - because the cpu is doing work as well and you do not want to have the cpu to sync every result if it is partial

one example is pme decomposition in gromacs.

storystarling 1 month ago

The killer app here is likely LLM inference loops. Currently you pay a PCIe latency penalty for every single token generated because the CPU has to handle the sampling and control logic. Moving that logic to the GPU and keeping the whole generation loop local avoids that round trip, which turns out to be a major bottleneck for interactive latency.

radarsat1 1 month ago
I don't know what the pros are doing but I'd be a bit shocked if it isn't already done this way in real production systems. And it doesn't feel like porting the standard library is necessary for this, it's just some logic.
- storystarling 1 month ago
  
  Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.
  
  1 reply →
tucnak 1 month ago
Turns out how? Where are the numbers?
- storystarling 1 month ago
  
  It is less about the raw transfer speed and more about the synchronization and kernel launch overheads. If you profile a standard inference loop with a batch size of 1 you see the GPU spending a lot of time idle waiting for the CPU to dispatch the next command. That is why optimizations like CUDA graphs exist, but moving the control flow entirely to the device is the cleaner solution.
  
  2 replies →