← Back to context

Comment by connicpu

4 hours ago

When you run CUDA at scale dealing with nvidia driver and library bugs takes up a disgustingly large percentage of engineer time, I don't know a lot of people who would be looking forward to rely on more nvidia libraries.

How do you determine that the bugs you run into are located in the Nvidia drivers and libraries?

Way back when I wrote the OpenCL driver at Qualcomm, we would frequently get bug reports from customers complaining about our code. During my tenure, every single one of them was root-caused as an application bug. Unsurprisingly, considering that our code was backed by an extensive test suite and their code wasn't.

Not to say that our code was perfect, of course. But people have a tendency to blame GPU drivers when the problem often lies elsewhere.

  • If you're big enough you can get direct access to Nvidia engineers, and they are usually transparent when they find out the bug was in their software and send you a patched version to try to resolve the issue