Comment by codedokode
11 days ago
I think it is possible to run CPU code on GPU (including the whole OS), because GPU has registers, memory, arithmetic and branch instructions, and that should be enough. However, it will be able to use only several cores from many thousands because GPU cores are effectively wide SIMD cores, grouped into the clusters, and CPU-style code would use only single SIMD lane. Am I wrong?
Given enough time, we'll all loop back around to the Xeon Phi: https://en.wikipedia.org/wiki/Xeon_Phi
It was ahead of its time!
When I was in grad school I tried getting my hands on a phi, it seemed impossible.
Xeon Phi was so cool. I wanted to use the ones we had so much... but couldn't find any applications that would benefit enough to make it worth the effort. I guess that's why it died lol.
This seems correct to me. Of course you'd need to build a CPU emulator to run CPU code. A single GPU core is apparently about 100x slower than a single CPU core. With emulation a 1000x slowdown might be expected. So with a lot of handwaving, expect performance similar to a 4 MHz processor.
Obviously code designed for a GPU is much faster. You could probably build a reasonable OS that runs on the GPU.
You don't need an emulator, you can compile into GPU machine code.
GPUs having have thousands of cores is just a silly marketing newspeak.
They rebranded SIMD lanes "cores". For eaxmple Nvidia 5000 series GPUs have 50-170 SMs which are the equivalent of cpu cores there. So a more than desktops, less than bigger server CPUs. By this math each avx-512 cpu core has 16-64 "gpu cores".
170 compute units is still a crapload of em for a non-server platform with non-server platform requirements. so the broad "lots of cores" point is still true, just highly overstated as you said. plus those cores are running the equivalent of n-way SMT processing, which gives you an even higher crapload of logical threads. AND these logical threads can also access very wide SIMD when relevant, which even early Intel E-cores couldn't. All of that absolutely matters.
Each SM can typically schedule 4 warps so it’s more like 400 “cores” each with 1024-bit SIMD instructions. If you look at it this way, they clearly outclass CPU architectures.
This level corresponds to SMT in CPUs I gather. So you can argue your 192 core EPYC server cpu has 384 "vCPUs" since execution resources per core are overprovisioned and when execution blocks waiting for eg memory another thread can run in its place. As Intel and AMD only do 2-way SMT this doesn't make the numbers go up as much.
The single GPU warp is both beefier and wimpier than the SMT thread: they're in-order barely superscalar, whereas on CPU side it's wide superscalar big-window OoO brainiac. But on the other hand the SM has wider SIMD execution resources and there's enough througput for several warps without blocking.
A major difference is how the execution resources are tuned to the expected workloads. CPU's run application code that likes big low latency caches and high single thread performance on branchy integer code, but it doesn't pay to put in execution resources for maximizing AVX-512 FP math instructions per cycle or increasing memory bandwidth indefinitely.
3 replies →
Merely mislead by marketing. The x64 arch has 512bit registers and a hundred or so cores. The gpu arch has 1024bit registers and a few hundred SMs or CUs, being the thing equivalent to an x64 core.
The software stacks running on them are very different but the silicon has been converging for years.