Comment by dragontamer

1 year ago

There's a lot here that seems to misunderstand GPUs and SIMD.

Note that raytracing is a very dynamic problem, where the GPU isn't sure if a ray hits a geometry or if it misses. When it hits, the ray needs to bounce, possibly multiple times.

Various implementations of raytracing, recursion, dynamic parallelism or whatever. Its all there.

Now the software / compilers aren't ready (outside of specialized situations like Microsofts DirectX Raytracing, which compiles down to a very intriguing threading model). But what was accomplished with DirectX can be done in other situations.

-------

Connection Machine is before my time, but there's no way I'd consider that 80s hardware to be comparable to AVX2 let alone a modern GPU.

Connection Machine was a 1-bit computer for crying out loud, just 4096 of them in parallel.

Xeon Phi (70 core Intel Atoms) is slower and weaker than 192 core Modern EPYC chips.

-------

Today's machines are better. A lot better than the past machines. I cannot believe any serious programmer would complain about the level of parallelism we have today and wax poetic about historic and archaic computers.

The problems I'm having are very different than those for raytracing. Sure, it's dynamic, but at a fine granularity, so the problems you run into are divergence, and often also wanting function pointers, which don't work well in a SIMT model, By contrast, the way I'm doing 2D there's basically no divergence (monoids are cool that way) but there is a need to schedule dynamically at a coarser (workgroup) level.

But the biggest problem I'm having is management of buffer space for intermediate objects. That's not relevant to the core of raytracing because you're fundamentally just accumulating an integral, then writing out the answer for a single pixel at the end.

The problem with the GPU raytracing work is that they built hardware and driver support for the specific problem, rather than more general primitives on which you could build not only raytracing but other applications. The same story goes for video encoding. Continuing that direction leads to unmanageable complexity.

Of course today's machines are better, they have orders of magnitude more transistors, and crystallize a ton of knowledge on how to build efficient, powerful machines. But from a design aesthetic perspective, they're becoming junkheaps of special-case logic. I do think there's something we can learn from the paths not taken, even if, quite obviously, it doesn't make sense to simply duplicate older designs.

  • > But the biggest problem I'm having is management of buffer space for intermediate objects. That's not relevant to the core of raytracing because you're fundamentally just accumulating an integral, then writing out the answer for a single pixel at the end.

    True Allocation just seems to be a "forced sequential" operation. A "stop the world, figure out what RAM is available" kind of thing.

    If you can work with pre-allocated buffers, then GPUs work by reading from lists (consume operations), and then outputting to lists (append operations). Which can be done with gather / scatter, or more precisely stream-expansion and stream-compaction in a grossly parallel manner.

    ---------

    If that's not enough "memory management" for you, then yeah, CPU is the better device to work with. At which point I again point back to the 192-core EPYC Zen5c example, we have grossly parallel CPUs today if you need them. Just a few clicks away to rent from cloud providers like Amazon or Azure.

    GPUs are good at certain things (and I consider the pinnacle of "Connection Machine" style programming. Just today's GPUs are far more parallel, far easier to program and far faster than the old 1980s stuff).

    Some problems cannot be split up (ex: web requests are so unique I cannot imagine they'd ever be programmed into a GPU due to their divergence). However CPUs still exist for that.

  • > But the biggest problem I'm having is management of buffer space for intermediate objects

    My advice for right now (barring new APIs), if you can get away with it, is to pre-allocate a large scratch buffer for as big of a workload as you will have over the program's life, and then have shaders virtually sub-allocate space within that buffer.

  • Agreed, there are two different problems being described here.

    1. Divergence of threads within a workgroup/SM/whatever

    2. Dynamically scheduling new workloads (i.e. dispatches, draws, etc) in response to the output of a previous workload

    Raytracing is problem #1 (and has it's own solutions, like shader execution reodering), while Raph is talking about problem #2.

    • > Raytracing is problem #1 (and has it's own solutions, like shader execution reodering)

      The "solution" to Raytracing (ignoring hardware acceleration like shader reordering), is stream compaction and stream expansion.

          if (ray hit){ 
              push(hits_array, currentRay); 
          } else { 
              push (miss_array, currentRay); 
          }
      

      If you are willing to have lots of loops inside of a shader (not always possible due to Windows's 2 second maximum), you can while(hits_array is not empty) kind of code, allowing your 1024-wavegroup to keep recursively calling all of the hits and efficiently processing all of the rays recursively.

      --------

      The important tidbit is that this technique generalizes. If you have 5 functions that need to be "called" after your current processing, then it becomes:

          if (func1 needs to be called next){ 
              push(func1, dataToContinue);
          } else if (func2 needs to be called next){ 
              push(func2, dataToContinue);
          } else if (func3 needs to be called next){ 
              push(func3, dataToContinue);
          } else if (func4 needs to be called next){ 
              push(func4, dataToContinue);
          } else if (func5 needs to be called next){ 
              push(func5, dataToContinue);
          }
      

      Now of course we can't grow "too far", GPUs can't handle divergence very well. But for "small" numbers of next-arrays and "small" amounts of divergence (ie: I'm assuming that func1 is the most common here, like 80%+ so that the buffers remain full), then this technique works.

      If you have more divergence than that, then you need to think more carefully about how to continue. Maybe GPUs are a bad fit (ex: any HTTP server code will be awful on GPUs) and you're forced to use a CPU.

Having talked to many engineers using distributed compute today, they seem to think that (single-node) parallel compute haven't changed much since ~2010 or so.

It's quite frustrating, and exacerbated by frequent intro-level CUDA blog posts which often just repeat what they've read.

re: raytracing, this might be crazy but, do you think we could use RT cores to accelerate control flow on the GPU? That would be hilarious!

  • RT cores? No. Too primitive and specific.

    But there is seemingly a generalization here to the Raytracing software ecosystem. I dunno how much software / hardware needs to advance here, but we are at the point where Intel RT cores are passing the stack pointers / instruction pointers between shaders (!!!). Yes through specialist hardware but surely this can be generalized to something awesome in the future?

    ------

    For now, I'm happy with stream expansion / stream compaction and looping over consume buffers and producer/append buffers.