I'm not too informed on the details, but iirc drivers _do_ try and optimize shaders in the background, and then when ready swaps in a better version. But I doubt it does stuff like change threadgroup size, the programmer might assume a certain size and their shader would be broken if changed. Also drivers doing background work means unpredictable performance and stuttering, which developers really don't like.
Someone correct me if I'm wrong, maybe drivers don't do this anymore.
Well, if the user isn't going to be sharing the GPU with another task then you could push things back to install-time. In other words: At install time you conduct a benchmark on the relevant shaders, rewrite as necessary, recompile, and save the results accordingly. Now the user has a version of your shaders optimized to their particular configuration. Since installation times are already somewhat lengthy anyway you can be reasonably certain that no one is going to miss an extra minute or two needed to conduct benchmarks, especially if it results in installing optimized code.
Coming from the neural network world, rather than the shader world, but: I'd say you're absolutely right!
Right now NNs and their workloads are changing quickly enough that people tend to prefer runtime optimization (like the dynamic/JIT compilation provided by Torch's compiler), but when you're confident you understand the workload and have the know-how, you can do static compilation (e.g. with ONNX, TensorRT).
I work on a serverless infrastructure product that gets used for NN inference on GPUs, so we're very interested in ways to amortize as much of that compilation and configuration work as possible. Maybe someday we'll even have something like what Redshift has in their query engine -- pre-compiled binaries cached across users.
The sad answer is... probably none of them. Runtime optimization has always been one of those things that sends most programmers running away screaming, and those who make languages never seem to come from the ranks of those who understand the clear utility of it.
Squeak Smalltalk has several automatic runtime optimizations and compilers like JIT, parallel load balancing compiler [1], adaptive compiler [2] and a metacircular simulator and byte code virtual machine written in itself that allows you to do runtime optimisations on GPUs. The byte codes are of course replaced with the native GPU instructions at runtime.
There are dozens of scientific papers and active research is still being done [1].
I've worked on automatic parallel runtime optimizations and adaptive compilers since 1981. We make reconfigurable hardware (chips and wafers) that also optimises at runtime.
Truffle/GraalVM is very rigid and overly complicated [6].
With a meta compiler like Ometa or Ohm we can give any programming language the runtime adaptive compilation for GPUs [3][4].
I'm currently adapting my adaptive compiler to Apple Silicon M4 GPU and neural engine to unlock the trillions of operations per second these chips can do.
I can adapt them to more NVIDIA GPUs with the information of the website in the title. Thank you very much charles_irl! I would love to be able to save the whole website in a single PDF.
I can optimise your GPU software a lot with my adaptive compilers. It will cost less than 100K in labour to speed up your GPU code by a factor 4-8 at least, sometimes I see 30-50 times speedup.
I'm not too informed on the details, but iirc drivers _do_ try and optimize shaders in the background, and then when ready swaps in a better version. But I doubt it does stuff like change threadgroup size, the programmer might assume a certain size and their shader would be broken if changed. Also drivers doing background work means unpredictable performance and stuttering, which developers really don't like.
Someone correct me if I'm wrong, maybe drivers don't do this anymore.
Well, if the user isn't going to be sharing the GPU with another task then you could push things back to install-time. In other words: At install time you conduct a benchmark on the relevant shaders, rewrite as necessary, recompile, and save the results accordingly. Now the user has a version of your shaders optimized to their particular configuration. Since installation times are already somewhat lengthy anyway you can be reasonably certain that no one is going to miss an extra minute or two needed to conduct benchmarks, especially if it results in installing optimized code.
Coming from the neural network world, rather than the shader world, but: I'd say you're absolutely right!
Right now NNs and their workloads are changing quickly enough that people tend to prefer runtime optimization (like the dynamic/JIT compilation provided by Torch's compiler), but when you're confident you understand the workload and have the know-how, you can do static compilation (e.g. with ONNX, TensorRT).
I work on a serverless infrastructure product that gets used for NN inference on GPUs, so we're very interested in ways to amortize as much of that compilation and configuration work as possible. Maybe someday we'll even have something like what Redshift has in their query engine -- pre-compiled binaries cached across users.
This reminds me of the dreaded Vulkan Shaders Compilation dialog when you try to play some games after driver update.
1 reply →
This is how autotuning often works yes
But which programming languages are most amenable to automatic runtime optimization?
Should we go back to FORTRAN?
The sad answer is... probably none of them. Runtime optimization has always been one of those things that sends most programmers running away screaming, and those who make languages never seem to come from the ranks of those who understand the clear utility of it.
Squeak Smalltalk has several automatic runtime optimizations and compilers like JIT, parallel load balancing compiler [1], adaptive compiler [2] and a metacircular simulator and byte code virtual machine written in itself that allows you to do runtime optimisations on GPUs. The byte codes are of course replaced with the native GPU instructions at runtime.
There are dozens of scientific papers and active research is still being done [1].
I've worked on automatic parallel runtime optimizations and adaptive compilers since 1981. We make reconfigurable hardware (chips and wafers) that also optimises at runtime.
Truffle/GraalVM is very rigid and overly complicated [6].
With a meta compiler like Ometa or Ohm we can give any programming language the runtime adaptive compilation for GPUs [3][4].
I'm currently adapting my adaptive compiler to Apple Silicon M4 GPU and neural engine to unlock the trillions of operations per second these chips can do.
I can adapt them to more NVIDIA GPUs with the information of the website in the title. Thank you very much charles_irl! I would love to be able to save the whole website in a single PDF.
I can optimise your GPU software a lot with my adaptive compilers. It will cost less than 100K in labour to speed up your GPU code by a factor 4-8 at least, sometimes I see 30-50 times speedup.
[1] https://www.youtube.com/watch?v=wDhnjEQyuDk
[2] https://www.youtube.com/watch?v=CfYnzVxdwZE
[3] https://tinlizzie.org/~ohshima/shadama2/
[4] https://github.com/yoshikiohshima/Shadama
[5] http://www.tinlizzie.org/ometa/
[6] https://github.com/NVIDIA/grcuda