Comment by bassp

9 months ago

I was taught that you want, usually, more threads per block than each SM can execute, because SMs context switch between threads (fancy hardware multi threading!) on memory read stalls to achieve super high throughput.

There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).

> I was taught that you want, usually, more threads per block > than each SM can execute, because SMs context switch between > threads (fancy hardware multi threading!) on memory read > stalls to achieve super high throughput.

You were taught wrong...

First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".

Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:

Amount of shared memory used per warp

* Makeup of the instructions to be executed

* Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).

  • Sorry if I was sloppy with my wording, instruction issuance is what I meant :)

    I thought that warps weren't issued instructions unless they were ready to execute (ie had all the data they needed to execute the next instruction), and that therefore it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once so that the warp scheduler can issue instructions to one warp while another waits on a memory read. Is that not true?

    • > warps weren't issued instructions unless they were ready to execute

      This is true, but after they've been issued, it still takes a while for the execution to conclude.

      > it was a best practice, in most (not all) cases to have more threads per block than the SM can execute at once

      Just replace "most" with "some". It really depends on what kind of kernel you're writing.

Pretty sure CUDA will limit your thread count to hardware constraints? You can’t just request a million threads.