Comment by david-gpu

17 hours ago

Look at my user profile. Divergence in modern NVidia GPUs does not work the way you think it does. A separate program counter per thread does not mean that on each clock each thread is issuing a different instruction. See section 3.2.2.1. of https://docs.nvidia.com/cuda/cuda-programming-guide/03-advan...

Of course divergence is sometimes unavoidable. That is why GPUs support it. But substantially divergent code comes at a significant cost.