← Back to context

Comment by dcrazy

1 day ago

One advantage of contemporary bytecode implementations is that many optimizations can occur in the “middle end”—which is to say on the IR itself, before lowering to ISA.

Yes, many optimizations can be done at the vendor-neutral IR level, but my point is that on GPUs they tend to be some of the computationally less expensive ones - the vast majority of the compilers time (in my experience) was in levels lower than that, like register allocation (as on GPUs "registers" are normally shared for all waves - so there's trade offs in using fewer registers but allowing more waves, for example), or trying to reorder things to hide latency from asynchronous units or higher latency instructions. And all those are very hardware specific.

It's a classic example of the "first 50%" being relatively easy - like an "optimizing" compiler can get pretty good with pretty simple constant propagation/inlining/dead code elimination. But that second 50% takes so much more effort.