← Back to context

Comment by Athas

2 hours ago

I have never used Qualcomm's OpenCL driver, but it is not unknown to get the NVIDIA driver into a state where some kernel is stuck in a running state, or some memory is allocated long after the originating process has terminated. This is usually down to application bugs, sure - but no application bug should be able to wedge the driver. While developing GPU kernels, the code will certainly be buggy, and hence the driver should be robust. For that matter, maybe I am running untrusted GPU code, and anytime the driver gets in a weird or stuck state, I am uneasy that it might not be many steps away from an exploitable situation. We don't accept this in CPU operating systems, so why should it be acceptable for GPUs? We are talking unprivileged code - nothing runs as root. Ever since I first got into GPGPU programming (about 2012), I noticed that they were far less robust in the face of buggy code than I was accustomed to.

It is also common in my experience for buggy GPU code to crash displays if the GPU is simultaneously used to drive a monitor. This usually happens for kernels that go into infinite loops, or out-of-memory conditions.

It is my understanding that modern GPU drivers even have watchdog systems that notice when they get stuck and forcibly reboot them, which to me is mere symptom treatment.