Comment by vimarsh6739
4 hours ago
One of the more subtle aspects of retargeting GPU code to run on the CPU is the presence of fine grained(read - block level and warp level) explicit synchronization mechanisms being available in the GPU. However, this is not the same in CPU land, so additional care has to be taken to handle this. One example of work which tries this is https://arxiv.org/pdf/2207.00257 .
Interestingly, in the same work, contrary to what you’d expect, transpiling GPU code to run on CPU gives ~76% speedups in HPC workloads compared to a hand optimized multi-core CPU implementation on Fugaku(a CPU only supercomputer), after accounting for these differences in synchronization.
A single CPU thread should be treated as basically a warp executing 4 simd vectors in parallel. The naïve implementation of __syncthreads() would be an atomic mechanism shared across all threads that contribute to what is GPU workgroup.
Looks like this entire paper is just about how to move/remove these barriers.