← Back to context

Comment by shihab

11 hours ago

To the author (or anyone from vectorware team), can you please give me, admittedly a skeptic, a motivating example of a "GPU-native" application?

That is, where does it truly make a difference to dispatch non-parallel/syscalls etc from GPU to CPU instead of dispatching parallel part of a code from CPU to GPU?

From the "Announcing VectorWare" page:

> Even after opting in, the CPU is in control and orchestrates work on the GPU.

Isn't it better to let CPUs be in control and orchestrate things as GPUs have much smaller, dumber cores?

> Furthermore, if you look at the software kernels that run on the GPU they are simplistic with low cyclomatic complexity.

Again, there's a obvious reason why people don't put branch-y code on GPU.

Genuinely curious what I'm missing.

Not OP but I'm currently make a city-builder computer game with a large procedurally-generated world. The terrain height at any point in the world is defined by function that takes a small number of constant parameters, and the horizontal position in the world, to give the height of the terrain at that position.

I need the heights on the GPU so I can modify the terrain meshes to fit the terrain. I need the heights on the CPU so I can know when the player is clicking the terrain and where to place things.

Rather than generating a heightmap on the CPU and passing a large heightmap texture to the GPU I have implemented the identical height generating functions in rust (CPU) and webgl (GPU). As you might imagine, its very easy for these to diverge and so I have to maintain a large set of tests that verify that generated heights are identical between implementations.

Being able to write this implementation once and run it on the CPU and GPU would give me much better guarantees that the results will be the same. (although necause of architecture differences and floating point handling they the results will never be perfect, but I just need them to be within an acceptable tolerance)

  • That's a good application but likely not one requiring a full standard library on the GPU? Procedurally generated data on GPU isn't uncommon AFAIK. It wasn't when I was dabbling in GPGPU stuff ~10 years ago.

    If you wrote in open cl, or via intel libraries, or via torch or arrayfire or whatever, you could dispatch it to both CPU and GPU at will.

The killer app here is likely LLM inference loops. Currently you pay a PCIe latency penalty for every single token generated because the CPU has to handle the sampling and control logic. Moving that logic to the GPU and keeping the whole generation loop local avoids that round trip, which turns out to be a major bottleneck for interactive latency.

  • I don't know what the pros are doing but I'd be a bit shocked if it isn't already done this way in real production systems. And it doesn't feel like porting the standard library is necessary for this, it's just some logic.

    • Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.

  • Turns out how? Where are the numbers?

    • It is less about the raw transfer speed and more about the synchronization and kernel launch overheads. If you profile a standard inference loop with a batch size of 1 you see the GPU spending a lot of time idle waiting for the CPU to dispatch the next command. That is why optimizations like CUDA graphs exist, but moving the control flow entirely to the device is the cleaner solution.

      2 replies →