Comment by storystarling
8 hours ago
It is less about the raw transfer speed and more about the synchronization and kernel launch overheads. If you profile a standard inference loop with a batch size of 1 you see the GPU spending a lot of time idle waiting for the CPU to dispatch the next command. That is why optimizations like CUDA graphs exist, but moving the control flow entirely to the device is the cleaner solution.
I'm not convinced. (A bit of advice: if you wish to make a statement about performance, always start by measuring things. Then when somebody asks you for proof/data, you would already have it.) If what you're saying were true, it would be a big deal, except unfortunately it isn't.
Dispatch has overheads, but it's largely insignificant. Where it otherwise would be significant:
1. Fused kernels exist
2. CUDA graphs (and other forms of work-submission pipelining) exist
CUDA graphs are pretty slow at synchronizing things.