Comment by vimarsh6739

6 months ago

One of the more subtle aspects of retargeting GPU code to run on the CPU is the presence of fine grained(read - block level and warp level) explicit synchronization mechanisms being available in the GPU. However, this is not the same in CPU land, so additional care has to be taken to handle this. One example of work which tries this is https://arxiv.org/pdf/2207.00257 .

Interestingly, in the same work, contrary to what you’d expect, transpiling GPU code to run on CPU gives ~76% speedups in HPC workloads compared to a hand optimized multi-core CPU implementation on Fugaku(a CPU only supercomputer), after accounting for these differences in synchronization.

3 comments

vimarsh6739

petermcneeley 6 months ago

A single CPU thread should be treated as basically a warp executing 4 simd vectors in parallel. The naïve implementation of __syncthreads() would be an atomic mechanism shared across all threads that contribute to what is GPU workgroup.

Looks like this entire paper is just about how to move/remove these barriers.

vimarsh6739 6 months ago
yes, but in practice, I believe people spam __syncthreads() in GPU kernels just to ensure correctness. There is value in statically proving that you don't need a synchronization instruction at a certain point. Doubly more so in the transpilation case, when you now find your naive __syncthreads() being called multiple times due to it being present in CUDA code(or MLIR in this case).
An interesting add on to me would be the handling of conditionals. Because newer GPUs have independent thread scheduling which is not present in the older ones, you have to wonder what is the desired behaviour if you are using CPU execution as a debugger of sorts(or are just GPU poor). It'd be super cool to expose those semantics as a compiler flag for your transpiler, allowing me to potentially debug some code as if it ran on an ancient GPU like a K80 for some fast local debugging.
But the ambitious question here is this - if you take existing GPU code, run it through a transpiler and generate better code than handwritten OpenMP, do you need to maintain an OpenMP backend for the CPU in the first place? It'd be better to express everything in a more richer parallel model with support for nested synchronization right? And let the compiler handle the job of inter-converting between parallelism models. It's like saying if Pytorch 2.0 generates good Triton code, we could just transpile that to CPUs and get rid of the CPU backend. (of course triton doesn't support all patterns so you would fall back to aten, and this kind of goes for a toss)
- petermcneeley 6 months ago
  
  > Because newer GPUs have independent thread scheduling I assume you mean at the warp level. The threads are not independent and there are many shaders you can write to prove this fact.
  I agree that statically proving that something like the syncing is unnecessary can only be a good thing.
  The question of why not simply take your GPU code and transpile to CPU code is more of the question of what did you originally lose in writing the GPU code to begin with. If you are talking about ML work most of that is expressed a bunch of matrix operations that naturally translate to GPUs with low impedance. But other kinds of operations might be better expressed directly as CPU code (any serial operations). And for CPU to GPU the loss as you have pointed out is probably in the synchronization.