Comment by petermcneeley

9 months ago

Great work. Nice aesthetic.

"These groups of threads, known as warps , are switched out on a per clock cycle basis — roughly one nanosecond. CPU thread context switches, on the other hand, take few hundred to a few thousand clock cycles"

I would note that intels SMT does do something very similar (2 hw threads). Other like the xeon phi would round robin 4 threads on a single core.

SMT isn't that really is it?

SMT allows for concurrent execution of both threads (thus independent front-end for fetch, decode especially) and certain core resources are statically partitioned unlike a warp being scheduled on SM.

I'm not a graphics expert but warps seem closer to run-time/dynamic VLIW than SMT.

  • In actual implementation they are very much like very wide SIMD on a CPU core. Each HW thread is a different warp as each warp can execute different instructions.

    This mapping is so close that translation from GPU to CPU relatively easy and performant.

Thanks!

> intels SMT does do something very similar (2 hw threads)

Yeah that's a good point. One thing I learned from looking at both hardware stacks more closely was that they aren't as different as they seem at first -- lots of the same ideas or techniques get are used, but in different ways.