Comment by ack_complete
2 months ago
AVX(2)'s main advantage is 256-bit width, since many of its operations are simply concatenated 128-bit ops (weird for ops like VPALIGNR), and cross-lane operations are expensive. NEON, on the other hand, only supports 128-bit ops, so AVX operations need to be split by the emulator.
I'd expect more of a gain from enabling FMA, but that's assuming the program actually got built to use FMA -- it needs to either use it explicitly or have relaxations to allow the contraction. Oryon has 4 x 128-bit NEON pipes with 3c latency fadd and 4c latency fmul/fma, so it easily ends up latency bottlenecked unless there are plenty of independent calculations.
No comments yet
Contribute on Hacker News ↗