← Back to context

Comment by imtringued

1 day ago

VLIW isn't bottlenecked on sufficiently advanced compilers. That's actually nonsense. VLIW is non-viable because it exposes micro-architectural aspects of the CPU, which require a new version of the ISA for every CPU generation to be taken advantage of and for all applications to be recompiled for the specific CPU you have.

In Arm and x86 land you get CPUs that run even the old code 10% faster with every CPU generation. Meanwhile with say AMD NPUs XDNA1 land is not the same as XDNA2 land. You have to rewrite your algorithms if you want things to get faster. The promised VLIW benefits work as intended. You can run loads, stores and vector operations all in the same cycle without any issue.

Then there is the crazy world of TTAs (transport triggered architectures), which take the VLIW stuff and crank it up to extreme levels. Instead of having an ISA, you directly control the buses connecting the function units. If you have multiple buses, then you can perform multiple transfers in parallel. You can build a custom set of function units specifically for your application and the compiler will automatically turn your C code into transfer instructions for any design you can come up with.

Now the first idea you get is to obviously just crank up the number of buses from 1 to 2 to 4 to 8, but then you notice that the number of cycles to run your algorithm doesn't go down as quickly as you'd hoped. There are a number of reasons for this, but if I had to choose a reason that favors dynamic scheduling, it would be that most sequential programming languages, especially C, don't expose enough parallelism to the compiler to be able to take advantage of it at compile time. If you could prove to the compiler that two functions f and g do not mutate the same data (local mutable state is allowed), then the TTA compiler could produce a mixed instruction stream that blends both functions to be executed in parallel rather than sequentially. This is similar to SMT with the exception that the overhead is zero and that you are allowed to run nano threads that run for a few nano seconds.