Comment by jroesch
2 years ago
Long time compiler hacker/engineer and compiler/programming language PhD here all great points. Worth saying out loud that many reasons why this stuff is slow is not due to bad code, its due to the fact that many of the best algorithms are in higher complexity classes and just scale poorly with program/translation unit size. For example when I was working on `rustc` a big challenge was heavy reliance on inlining and inlining has a cost, plus it increases program sizes making everything else take longer as well.
I feel like Go already went through a whole saga of this where the community started with "LLVM and SSA are bad and slow", then a few years go by and they end up building their own SSA IR and spending a bunch of time trying to bring compilation time closer to what it was before as it made everything much slower.
> I feel like Go already went through a whole saga of this where the community started with "LLVM and SSA are bad and slow"
I've been a contributor since the Go compiler was a tree-based C program and I've never heard anyone say that. What they said (and it's in the Go FAQ page) is: "At the beginning of the project we considered using LLVM for gc but decided it was too large and slow to meet our performance goals." [1]
If you're building a language with the explicit goal to make it compile fast, it's objectively true that starting out with LLVM is not the best approach. You'll get incredible runtime performance of the generated code since the early days, but NOT fast compilation. The Go makers choose a different tradeoff.
> and they end up building their own SSA IR
They switched to a SSA IR because it was a good idea to begin with, after an initial phase with the tree-base prototype. I've also never heard anyone argue that "SSA is bad", despite what you claim. The first compiler was tree-based because they reused a simple tree-based C compiler from plan9.
> building their own SSA IR and spending a bunch of time trying to bring compilation time closer to what it was before as it made everything much slower
The new compiler was ported to Go (machine-rewritten from C) and that's the main reason it was ~2x slower than the old compiler. It's not due to the switch to a SSA-IR.
[1] https://go.dev/doc/faq#Implementation
"Worth saying out loud that many reasons why this stuff is slow is not due to bad code, its due to the fact that many of the best algorithms are in higher complexity classes and just scale poorly with program/translation unit size"
Yes, this is totally true.
It's also possible to affect this in theory but hard in practice. Usually you shoot for making it O(N^2) (or whatever) where N is the number of variables you want to try to optimize instead of N being number of blocks.
The complete GVN and GVN-PRE in LLVM is theoretically N^3 or N^4, but as engineered it's often much faster (while getting more optimization) or at least not more than a few percent slower than the the existing O(N^2) GVN. It achieves this by being sparser in most cases. The old GVN has to iterate the algorithm, and iterates non-sparsely. Everything is reprocessed because it doesn't know what can change. The new one iterates only the things that could change using fine grained dependency tracking (something akin to how sparse constant prop works). This is often the best you can do.
I will say it's possible to affect these time bounds in theory because if you go down the single static rabbit hole, you realize it's more generally applicable. Kenny (and others) proved this in his thesis about sparse dataflow. That was the whole point. SSA is just one example of a form that enables things to be linear time (and in fact, you can do it without SSA at all using interval methods and such). There are papers that show this about other SS* forms and show examples, but they were mostly (sadly) ignored.
Concretely, for most optimizations like PRE/etc, there are linear time single static transforms of the IR that will make the optimization linear time (Or at least remove a factor of N for you) by explicitly exposing the dataflow in a way that matches what the optimization algorithm wants.
This is awesome in theory - like really cool when you think about it (if you are a compiler nerd), but also completely impractical (at least, AFAIK). Rewriting the IR form for each optimization to the exact requirements of the optimization (single static use, single static exposed uses, single static upwards exposed uses, single static exposed defs, etc) is much more expensive in practice than well engineered, higher complexity, "N in the number of variables" optimizations.
Equality saturation stands more of a chance, IMHO.