Comment by storystarling
7 hours ago
The killer app here is likely LLM inference loops. Currently you pay a PCIe latency penalty for every single token generated because the CPU has to handle the sampling and control logic. Moving that logic to the GPU and keeping the whole generation loop local avoids that round trip, which turns out to be a major bottleneck for interactive latency.
I don't know what the pros are doing but I'd be a bit shocked if it isn't already done this way in real production systems. And it doesn't feel like porting the standard library is necessary for this, it's just some logic.
Raw CUDA works for the heavy lifting but I suspect it gets messy once you implement things like grammar constraints or beam search. You end up with complex state machines during inference and having standard library abstractions seems pretty important to keep that logic from becoming unmaintainable.
Turns out how? Where are the numbers?
It is less about the raw transfer speed and more about the synchronization and kernel launch overheads. If you profile a standard inference loop with a batch size of 1 you see the GPU spending a lot of time idle waiting for the CPU to dispatch the next command. That is why optimizations like CUDA graphs exist, but moving the control flow entirely to the device is the cleaner solution.
I'm not convinced. (A bit of advice: if you wish to make a statement about performance, always start by measuring things. Then when somebody asks you for proof/data, you would already have it.) If what you're saying were true, it would be a big deal, except unfortunately it isn't.
Dispatch has overheads, but it's largely insignificant. Where it otherwise would be significant:
1. Fused kernels exist
2. CUDA graphs (and other forms of work-submission pipelining) exist
1 reply →