Comment by ack_complete

2 months ago

AVX(2)'s main advantage is 256-bit width, since many of its operations are simply concatenated 128-bit ops (weird for ops like VPALIGNR), and cross-lane operations are expensive. NEON, on the other hand, only supports 128-bit ops, so AVX operations need to be split by the emulator.

I'd expect more of a gain from enabling FMA, but that's assuming the program actually got built to use FMA -- it needs to either use it explicitly or have relaxations to allow the contraction. Oryon has 4 x 128-bit NEON pipes with 3c latency fadd and 4c latency fmul/fma, so it easily ends up latency bottlenecked unless there are plenty of independent calculations.

0 comments

ack_complete

No comments yet

Contribute on Hacker News ↗