Comment by dist-epoch

1 year ago

In CUDA you don't really manage the individual compute units, you start a kernel, and the drivers take care of distributing that to the compute cores and managing the data flows between them.

When programming CPUs however you are controlling and managing the individual threads. Of course, there are libraries which can do that for you, but fundamentally it's a different model.

The GPU equivalent of a single CPU "hardware thread" is called a "warp" or a "wavefront". GPU's can run many warps/wavefronts per compute unit by switching between warps to hide memory access latency. A CPU core can do this with two hardware threads, using Hyperthreading/2-way SMT, some CPU's have 4-way SMT, but GPU's push that quite a bit further.

What you say has nothing to do with CPU vs. GPU, or with CUDA, which is basically equivalent with the older OpenMP.

When you have a set of concurrent threads, each thread may run a different program. There are many applications where this is necessary, but such applications are hard to scale to very high levels of concurrency, because each thread must be handled individually by the programmer.

Another case is when all the threads run the same program, but on different data. This is equivalent with a concurrent execution of a "for" loop, which is always possible when the iterations are independent.

The execution of such a set of threads that execute the same program has been named "parallel DO instruction" by Melvin E. Conway in 1963, "array of processes" by C. A. R. Hoare in 1978, "replicated parallel" in the Occam programming language in 1985, SPMD around the same time, "PARALLEL DO" in the OpenMP Fortran language extension in 1997, "parallel for" in the OpenMP C/C++ language extension in 1998, and "kernel execution" in CUDA, which has also introduced the superfluous acronym SIMT to describe it.

When a problem can be solved by a set of concurrent threads that run the same program, then it is much simpler to scale the parallelism to extremely high levels and the parallel execution can usually be scheduled by a compiler or by a hardware controller without the programmer having to be concerned with the details.

There is no inherent difficulty in making a compiler that provides exactly the same programming model as CUDA, but which creates a program for a CPU, not for a GPU. Such compilers exist, e.g. ispc, which is mentioned in the parent article.

The difference between GPUs and CPUs is that the former appear to have some extra hardware support for what you describe as "distributing that to the compute cores and managing the data flows between them", but nobody is able to tell exactly what is done by this extra hardware support and whether it really matters, because it is a part of the GPUs that has never been documented publicly by the GPU vendors.

From the point of view of the programmer, this possible hardware advantage of the GPUs does not really matter, because there are plenty of programming language extensions for parallelism and libraries that can take care of the details of thread spawning and work distribution over SIMD lanes, regardless if the target is a CPU or a GPU.

Whenever you write a program equivalent with a "parallel for", which is the same as writing for CUDA, you do not manage individual threads, because what you write, the "kernel" in CUDA lingo, can be executed by thousands of threads, also on a CPU, not only on a GPU. A desktop CPU like Ryzen 9 9950X has the same product of threads by SIMD lanes like a big integrated GPU (obviously, discrete GPUs can be many times bigger).