Comment by fulafel
12 days ago
GPUs having have thousands of cores is just a silly marketing newspeak.
They rebranded SIMD lanes "cores". For eaxmple Nvidia 5000 series GPUs have 50-170 SMs which are the equivalent of cpu cores there. So a more than desktops, less than bigger server CPUs. By this math each avx-512 cpu core has 16-64 "gpu cores".
170 compute units is still a crapload of em for a non-server platform with non-server platform requirements. so the broad "lots of cores" point is still true, just highly overstated as you said. plus those cores are running the equivalent of n-way SMT processing, which gives you an even higher crapload of logical threads. AND these logical threads can also access very wide SIMD when relevant, which even early Intel E-cores couldn't. All of that absolutely matters.
Each SM can typically schedule 4 warps so it’s more like 400 “cores” each with 1024-bit SIMD instructions. If you look at it this way, they clearly outclass CPU architectures.
This level corresponds to SMT in CPUs I gather. So you can argue your 192 core EPYC server cpu has 384 "vCPUs" since execution resources per core are overprovisioned and when execution blocks waiting for eg memory another thread can run in its place. As Intel and AMD only do 2-way SMT this doesn't make the numbers go up as much.
The single GPU warp is both beefier and wimpier than the SMT thread: they're in-order barely superscalar, whereas on CPU side it's wide superscalar big-window OoO brainiac. But on the other hand the SM has wider SIMD execution resources and there's enough througput for several warps without blocking.
A major difference is how the execution resources are tuned to the expected workloads. CPU's run application code that likes big low latency caches and high single thread performance on branchy integer code, but it doesn't pay to put in execution resources for maximizing AVX-512 FP math instructions per cycle or increasing memory bandwidth indefinitely.
Right, but the CPU does not have a matrix multiply unit or high bandwidth memory.
2 replies →