Comment by petermcneeley
6 days ago
Great work. Nice aesthetic.
"These groups of threads, known as warps , are switched out on a per clock cycle basis — roughly one nanosecond. CPU thread context switches, on the other hand, take few hundred to a few thousand clock cycles"
I would note that intels SMT does do something very similar (2 hw threads). Other like the xeon phi would round robin 4 threads on a single core.
SMT isn't that really is it?
SMT allows for concurrent execution of both threads (thus independent front-end for fetch, decode especially) and certain core resources are statically partitioned unlike a warp being scheduled on SM.
I'm not a graphics expert but warps seem closer to run-time/dynamic VLIW than SMT.
In actual implementation they are very much like very wide SIMD on a CPU core. Each HW thread is a different warp as each warp can execute different instructions.
This mapping is so close that translation from GPU to CPU relatively easy and performant.
Thanks!
> intels SMT does do something very similar (2 hw threads)
Yeah that's a good point. One thing I learned from looking at both hardware stacks more closely was that they aren't as different as they seem at first -- lots of the same ideas or techniques get are used, but in different ways.