Comment by petermcneeley
5 months ago
> Because newer GPUs have independent thread scheduling I assume you mean at the warp level. The threads are not independent and there are many shaders you can write to prove this fact.
I agree that statically proving that something like the syncing is unnecessary can only be a good thing.
The question of why not simply take your GPU code and transpile to CPU code is more of the question of what did you originally lose in writing the GPU code to begin with. If you are talking about ML work most of that is expressed a bunch of matrix operations that naturally translate to GPUs with low impedance. But other kinds of operations might be better expressed directly as CPU code (any serial operations). And for CPU to GPU the loss as you have pointed out is probably in the synchronization.
No comments yet
Contribute on Hacker News ↗