The Missing Nvidia GPU Glossary

6 days ago (modal.com)

The weird part of the programming model is that threadblocks don't map 1:1 to warps or SMs. A single threadblock executes on a single SM, but each SM has multiple warps, and the threadblock could be the size of a single warp, or larger than the combined thread count of all warps in the SM.

So, how large do you make your threadblocks to get optimal SM/warp scheduling? Well it "depends" based on resource usage, divergence, etc. Basically run it, profile, switch the threadblock size, profile again, etc. Repeat on every GPU/platform (if you're programming for multiple GPU platforms and not just CUDA, like games do). It's a huge pain, and very sensitive to code changes.

People new to GPU programming ask me "how big do I make the threadblock size?" and I tell them go with 64 or 128 to start, and then profile and adjust as needed.

Two articles on the AMD side of things:

https://gpuopen.com/learn/occupancy-explained

https://gpuopen.com/learn/optimizing-gpu-occupancy-resource-...

  • I was taught that you want, usually, more threads per block than each SM can execute, because SMs context switch between threads (fancy hardware multi threading!) on memory read stalls to achieve super high throughput.

    There are, ofc, other concerns like register pressure that could affect the calculus, but if an SM is waiting on a memory read to proceed and doesn’t have any other threads available to run, you’re probably leaving perf on the table (iirc).

    • > I was taught that you want, usually, more threads per block > than each SM can execute, because SMs context switch between > threads (fancy hardware multi threading!) on memory read > stalls to achieve super high throughput.

      You were taught wrong...

      First, "execution" on an SM is a complex pipelined thing, like on a CPU core (except without branching). If you mean instruction issues, an SM can up to issue up to 4 instructions, one for each of 4 warps per cycle (on NVIDIA hardware for the last 10 years). But - there is no such thing as an SM "context switch between threads".

      Sometimes, more than 432 = 128 threads is a good idea. Sometimes, it's a bad idea. This depends on things like:

      Amount of shared memory used per warp

      * Makeup of the instructions to be executed

      * Register pressure, like you mentioned (because once you exceed 256 threads per block, the number of registers available per thread starts to decrease).

      3 replies →

  • 100% -- there's basically no substitue for benchmarking! I find the empiricism kind of comforting, coming from a research science background.

    IIUC, even CuBLAS basically just uses a bunch of heuristics that are mostly derived from benchmarking to decide with kernels to use.

  • > It's a huge pain, and very sensitive to code changes.

    Optimization is very often like that. Making things generic, uniform and simple typically has a performance penalty - and you use your GPU because you care about that stuff.

  • Sounds like the sort of thing that would lend itself to runtime optimization.

    • I'm not too informed on the details, but iirc drivers _do_ try and optimize shaders in the background, and then when ready swaps in a better version. But I doubt it does stuff like change threadgroup size, the programmer might assume a certain size and their shader would be broken if changed. Also drivers doing background work means unpredictable performance and stuttering, which developers really don't like.

      Someone correct me if I'm wrong, maybe drivers don't do this anymore.

      5 replies →

It would be nice if this also included terms that are often used by Nvidia that apparently come from computer architecture (?) but are basically foreign to software engineers, like “scoreboard” or “math pipe”.

FINALLY. Nvidia's always been pretty craptacular when it comes to their documentation. It's really hard to read unless you already know their internal names for, well, just about everything.

  • Nvidia isn't very big on opensource either. Most CUDA libraries are still closed source. I think this might eventually be their downfall, because people want to know what they are working with. For example with PyTorch, I can profile the library against my use case and then decide to modify the official library to get some bespoke optimization. With CUDA, if I need to do that, I need to start from scratch and guess as to whether the library from the api already has such optimizations.

    • NVIDIA does have a bunch of FOSS libraries - like CUB and Thrust (now part of CCCL). But - they tend to suffer from "not invented here" syndrome [1] ; so they seem to avoid collaboration on FOSS they don't manage/control by themselves.

      I have a bit of a chip on my shoulder here, since I've been trying to pitch my Modern C++ API wrappers to them for years, and even though I've gotten some appreciative comments from individuals, they have shown zero interest.

      https://github.com/eyalroz/cuda-api-wrappers/

      There is also their driver, which is supposedly "open source", but actually none of the logic is exposed to you. Their runtime library is closed too, their management utility (nvidia-smi), their LLVM-based compiler, their profilers, their OpenCL stack :-(

      I must say they do have relatively extensive documentation, even if it doesn't cover everything.

      [1] - https://en.wikipedia.org/wiki/Not_invented_here

      5 replies →

Oh hey, I wrote this!

Thanks for sharing it.

  • Looks nice. I'm not sure if this is the place for it, but what I am always searching for is a very concise table of the different GPUs available with approximate compute power and costs. Lists such as wikipedia [1] are way to complicated.

    [1] https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_proces...

    • Yeah, there's a tension between showing enough information to be useful for driving decisions and hiding enough information.

      For example, "compute capability" sounds like it'd be what you need, but it's actually more of a software versioning index :(

      Was thinking of splitting the difference by collecting up the quoted arithmetic (FLOP/s) and memory bandwidths from the manufacturer datasheets. But there's caveats there too, e.g. the dreaded "With sparsity" asterisk on the Tensor Core FLOP/s of recent generations.

      1 reply →

  • Thank you for this.

    Any chance you could just make it a single long webpage (as opposed to making me click through one page at a time)?

    For some reason on my iPad the links don’t always work the first time I click them.

  • Great work. Nice aesthetic.

    "These groups of threads, known as warps , are switched out on a per clock cycle basis — roughly one nanosecond. CPU thread context switches, on the other hand, take few hundred to a few thousand clock cycles"

    I would note that intels SMT does do something very similar (2 hw threads). Other like the xeon phi would round robin 4 threads on a single core.

    • SMT isn't that really is it?

      SMT allows for concurrent execution of both threads (thus independent front-end for fetch, decode especially) and certain core resources are statically partitioned unlike a warp being scheduled on SM.

      I'm not a graphics expert but warps seem closer to run-time/dynamic VLIW than SMT.

      1 reply →

    • Thanks!

      > intels SMT does do something very similar (2 hw threads)

      Yeah that's a good point. One thing I learned from looking at both hardware stacks more closely was that they aren't as different as they seem at first -- lots of the same ideas or techniques get are used, but in different ways.

  • Thanks! As an old (retired) programmer I was hoping a good intro to GPUs would turn up. Now, I don't suppose you could add 'ink on paper' to the color options? Gray on light gray, with medium gray highlighting, is hard on old eyes. While I never want to see P7 phosphor green again. And I suppose a zipfile of the whole thing, for local reading and archive, would be out of the question?

Really great work, suggest for a next post: the VRAM requirements estimation calculation for running models locally. Especially with different architecture and different Quants, it gets always confusing and even online calculators give different answer. I never found a really good deep dive on this yet.

Incredible work, thank you so much! This will hopefully break down more barriers to entry for newcomers wanting to work with GPUs!

  • Thanks for the kind words! I still feel like one of those newcomers myself :)

    Now that so many more people are running workloads, including critical ones, on GPUs, it feels much more important that a base level of knowledge and intuition is broadly disseminated -- kinda like how most engineers basically grok database index management, even if they couldn't write a high-performance B+ tree from scratch. Hope this document helps that along!

Is there a plain text / markdown / html version?

  • I would also like to see a PDF that has all the text in one place, presented linearly. This looks like a very worthwhile read, but waiting a few seconds for two paragraphs to load is a very frustrating user experience.

    • A few seconds is way longer than we intended! When I click around all pages after the first load in milliseconds.

      Do you have any script blockers, browser cache settings, or extensions that might mess with navigation?

      > would also like to see a PDF that has all the text in one place, presented linearly

      Yeah, good idea! I think a PDF with links so that it's still easy to cross-reference terms would get the best of both worlds.

      2 replies →