Comment by david-gpu

3 hours ago

How do you determine that the bugs you run into are located in the Nvidia drivers and libraries?

Way back when I wrote the OpenCL driver at Qualcomm, we would frequently get bug reports from customers complaining about our code. During my tenure, every single one of them was root-caused as an application bug. Unsurprisingly, considering that our code was backed by an extensive test suite and their code wasn't.

Not to say that our code was perfect, of course. But people have a tendency to blame GPU drivers when the problem often lies elsewhere.

I have never used Qualcomm's OpenCL driver, but it is not unknown to get the NVIDIA driver into a state where some kernel is stuck in a running state, or some memory is allocated long after the originating process has terminated. This is usually down to application bugs, sure - but no application bug should be able to wedge the driver. While developing GPU kernels, the code will certainly be buggy, and hence the driver should be robust. For that matter, maybe I am running untrusted GPU code, and anytime the driver gets in a weird or stuck state, I am uneasy that it might not be many steps away from an exploitable situation. We don't accept this in CPU operating systems, so why should it be acceptable for GPUs? We are talking unprivileged code - nothing runs as root. Ever since I first got into GPGPU programming (about 2012), I noticed that they were far less robust in the face of buggy code than I was accustomed to.

It is also common in my experience for buggy GPU code to crash displays if the GPU is simultaneously used to drive a monitor. This usually happens for kernels that go into infinite loops, or out-of-memory conditions.

It is my understanding that modern GPU drivers even have watchdog systems that notice when they get stuck and forcibly reboot them, which to me is mere symptom treatment.

If you're big enough you can get direct access to Nvidia engineers, and they are usually transparent when they find out the bug was in their software and send you a patched version to try to resolve the issue