SIMD City: Auto-Vectorisation

2 months ago (xania.org)

19 comments

brewmarche

Auto-vectorization is consistently one of the least predictable optimization passes, which is rather awful, since when it doesn't trigger your functions are suddenly >3x slower. This drives people to more explicit SIMD coding, from direct assembly like in FFMPEG to wrappers providing some cross-platform support like Google's Highway.

It's just really hard to detect and exploit profitable and safe vectorization opportunities. The theory behind some of the optimizers is beautiful, though: https://en.wikipedia.org/wiki/Polytope_model

drob518 1 month ago
I’m always shocked at what the compiler is able to deduce wrt vectorization. When it works, it’s magical.
- dwattttt 1 month ago
  
  In the abstract, it's the inverse of the argument that "configuration formats should be programming languages"; the more general something can be, the less you can assume about it.
  A way to express the operations you want, without unintentionally expressing operations you don't want, would be much easier to auto-vectorise. I'm not familiar enough with SIMD to give examples, but if a transformation would preserve the operations you want, but observably be different to what you coded, I assume it's not eligible (unless you enable flags that allow a compiler to perform optimisations that produce code that's not quite what you wrote).
  
  4 replies →
gnufx 1 month ago

In most of the cases I've seen where people felt the need for intrinsics, GCC will vectorize it -- at least if it's allowed to use the same potentially-incorrect semantics as the intrinsics version -- and potentially for multiple micro-architectures with GCC's target_clones attribute. GCC's -fopt-... flags can give you a lot of information on vectorization and other optimizations, if maybe couched in somewhat compiler-internal jargon, and other compiler probably do something similar. Vectorizing compilers have existed for 50-ish years, so it's well-established stuff.
vkazanov 1 month ago
It seems that proper vectorization requires a different kind of language, something similar to cuda and the like, not a general putpose scalar kind of language.
I remember intel had something like it but it went nowhere.
- astrange 1 month ago
  
  That is ispc.
  You don't want "vectorization" though, you either want
  a) a code generation tool that generates exactly the platform-specific code you want and can't silently fail.
  b) at least a fundamentally vectorized language that does "scalarization" instead of the other way round.
  
  1 reply →
webdevver 1 month ago
i am quietly waiting for the "bitter lesson" to hit compilers: a large language model that speaks in LLVM IR tokens that takes unoptimized IR from the frontend, and spits out an optimized version that works better than any "classical" compiler.
the only thing that might stand in the way is a dependence on reproducibility, but it seems like a weak argument: We already have a long history of people trying to push build reproducibility, and for better or worse they never got traction.
same story with LTO and PGO: I can't think of anyone other than browser and compiler people who are using either (and even they took a long time before they started using them). judged to be more effort than its worth i guess.
- robertknight 1 month ago
  
  The major constraint is that the compiler needs to guarantee that transformations produce semantically identical results to the unoptimized code, with the exception of undefined behavior or specific opt-outs (eg. `-ffast-math` rules).
  An ML model can fit into existing compiler pipelines anywhere that heuristics are used though, as an alternative to PGO.
- ultrahax 1 month ago
  
  Us video game folks are big fans of LTO, PGO, FDO, etc.
  
  2 replies →
- Earw0rm 1 month ago
  
  How's it going in the other direction - LLMs as disassemblers?
  I tried it a year or so back and was sorta disappointed at the results beyond simple cases, but it feels like an area that could improve rapidly.
- gnufx 1 month ago
  
  Fedora, for instance, is built with LTO, except for some packages which it breaks. I've forgotten the details of where I had to turn it off.

mgaunard 1 month ago

You don't necessarily need to lay out your data in arrays to use SIMD, though it certainly makes things more straightforward.