← Back to context

Comment by imtringued

7 hours ago

Considering that we live in the age of megakernels where the cost of CPU->GPU->CPU data transfer and kernel launch overhead are becoming ever bigger performance bottlenecks I would have expected more enthusiasm in this comment section.

Surely there is some value in the ability to test your code on the CPU for logic bugs with printf/logging, easy breakpoints, etc and then run it on the GPU for speed? [0]

Surely there is some value in being able to manage KV caches and perform continuous batching, prefix caching, etc, directly on the GPU through GPU side memory allocations?

Surely there is some value in being able to send out just the newly generated tokens from the GPU kernel via a quick network call instead of waiting for all sessions in the current batch to finish generating their tokens?

Surely there is some value in being able to load model parameters from the file system directly into the GPU?

You could argue that I am too optimistic, but seemingly everyone here is stuck on the idea of running existing CPU code without ever even attempting to optimize the bottlenecks rather than having GPU heavy code interspersed with less GPU heavy code. It's all or nothing to you guys.

[0] Assuming that people won't write GPU optimized code at all is bad faith because the argument I am presenting here is that you test your GPU-first code on the CPU rather than pushing CPU-first code on the GPU.