Comment by raphlinus

1 year ago

The problems I'm having are very different than those for raytracing. Sure, it's dynamic, but at a fine granularity, so the problems you run into are divergence, and often also wanting function pointers, which don't work well in a SIMT model, By contrast, the way I'm doing 2D there's basically no divergence (monoids are cool that way) but there is a need to schedule dynamically at a coarser (workgroup) level.

But the biggest problem I'm having is management of buffer space for intermediate objects. That's not relevant to the core of raytracing because you're fundamentally just accumulating an integral, then writing out the answer for a single pixel at the end.

The problem with the GPU raytracing work is that they built hardware and driver support for the specific problem, rather than more general primitives on which you could build not only raytracing but other applications. The same story goes for video encoding. Continuing that direction leads to unmanageable complexity.

Of course today's machines are better, they have orders of magnitude more transistors, and crystallize a ton of knowledge on how to build efficient, powerful machines. But from a design aesthetic perspective, they're becoming junkheaps of special-case logic. I do think there's something we can learn from the paths not taken, even if, quite obviously, it doesn't make sense to simply duplicate older designs.

4 comments

raphlinus

dragontamer 1 year ago

> But the biggest problem I'm having is management of buffer space for intermediate objects. That's not relevant to the core of raytracing because you're fundamentally just accumulating an integral, then writing out the answer for a single pixel at the end.

True Allocation just seems to be a "forced sequential" operation. A "stop the world, figure out what RAM is available" kind of thing.

If you can work with pre-allocated buffers, then GPUs work by reading from lists (consume operations), and then outputting to lists (append operations). Which can be done with gather / scatter, or more precisely stream-expansion and stream-compaction in a grossly parallel manner.

---------

If that's not enough "memory management" for you, then yeah, CPU is the better device to work with. At which point I again point back to the 192-core EPYC Zen5c example, we have grossly parallel CPUs today if you need them. Just a few clicks away to rent from cloud providers like Amazon or Azure.

GPUs are good at certain things (and I consider the pinnacle of "Connection Machine" style programming. Just today's GPUs are far more parallel, far easier to program and far faster than the old 1980s stuff).

Some problems cannot be split up (ex: web requests are so unique I cannot imagine they'd ever be programmed into a GPU due to their divergence). However CPUs still exist for that.

jms55 1 year ago

> But the biggest problem I'm having is management of buffer space for intermediate objects

My advice for right now (barring new APIs), if you can get away with it, is to pre-allocate a large scratch buffer for as big of a workload as you will have over the program's life, and then have shaders virtually sub-allocate space within that buffer.

jms55 1 year ago

Agreed, there are two different problems being described here.

1. Divergence of threads within a workgroup/SM/whatever

2. Dynamically scheduling new workloads (i.e. dispatches, draws, etc) in response to the output of a previous workload

Raytracing is problem #1 (and has it's own solutions, like shader execution reodering), while Raph is talking about problem #2.

dragontamer 1 year ago
> Raytracing is problem #1 (and has it's own solutions, like shader execution reodering)
The "solution" to Raytracing (ignoring hardware acceleration like shader reordering), is stream compaction and stream expansion.
if (ray hit){ push(hits_array, currentRay); } else { push (miss_array, currentRay); }
If you are willing to have lots of loops inside of a shader (not always possible due to Windows's 2 second maximum), you can while(hits_array is not empty) kind of code, allowing your 1024-wavegroup to keep recursively calling all of the hits and efficiently processing all of the rays recursively.
--------
The important tidbit is that this technique generalizes. If you have 5 functions that need to be "called" after your current processing, then it becomes:
if (func1 needs to be called next){ push(func1, dataToContinue); } else if (func2 needs to be called next){ push(func2, dataToContinue); } else if (func3 needs to be called next){ push(func3, dataToContinue); } else if (func4 needs to be called next){ push(func4, dataToContinue); } else if (func5 needs to be called next){ push(func5, dataToContinue); }
Now of course we can't grow "too far", GPUs can't handle divergence very well. But for "small" numbers of next-arrays and "small" amounts of divergence (ie: I'm assuming that func1 is the most common here, like 80%+ so that the buffers remain full), then this technique works.
If you have more divergence than that, then you need to think more carefully about how to continue. Maybe GPUs are a bad fit (ex: any HTTP server code will be awful on GPUs) and you're forced to use a CPU.