Comment by saagarjha

1 year ago

While the instructions are different, every platform will have some implementation of the basic operations (load, store, broadcast, etc.), perhaps with a different bit width. With those you can write an accelerated baseline implementation, typically (sometimes these are autogenerated/use some sort of portable intrinsics, but usually they don't). If you want to go past that then things get more complicated and you will have specialized algorithms for what is available.