Comment by jms55

1 year ago

Agreed, there are two different problems being described here.

1. Divergence of threads within a workgroup/SM/whatever

2. Dynamically scheduling new workloads (i.e. dispatches, draws, etc) in response to the output of a previous workload

Raytracing is problem #1 (and has it's own solutions, like shader execution reodering), while Raph is talking about problem #2.

1 comment

jms55

dragontamer 1 year ago

> Raytracing is problem #1 (and has it's own solutions, like shader execution reodering)

The "solution" to Raytracing (ignoring hardware acceleration like shader reordering), is stream compaction and stream expansion.

    if (ray hit){ 
        push(hits_array, currentRay); 
    } else { 
        push (miss_array, currentRay); 
    }

If you are willing to have lots of loops inside of a shader (not always possible due to Windows's 2 second maximum), you can while(hits_array is not empty) kind of code, allowing your 1024-wavegroup to keep recursively calling all of the hits and efficiently processing all of the rays recursively.

--------

The important tidbit is that this technique generalizes. If you have 5 functions that need to be "called" after your current processing, then it becomes:

    if (func1 needs to be called next){ 
        push(func1, dataToContinue);
    } else if (func2 needs to be called next){ 
        push(func2, dataToContinue);
    } else if (func3 needs to be called next){ 
        push(func3, dataToContinue);
    } else if (func4 needs to be called next){ 
        push(func4, dataToContinue);
    } else if (func5 needs to be called next){ 
        push(func5, dataToContinue);
    }

Now of course we can't grow "too far", GPUs can't handle divergence very well. But for "small" numbers of next-arrays and "small" amounts of divergence (ie: I'm assuming that func1 is the most common here, like 80%+ so that the buffers remain full), then this technique works.

If you have more divergence than that, then you need to think more carefully about how to continue. Maybe GPUs are a bad fit (ex: any HTTP server code will be awful on GPUs) and you're forced to use a CPU.