Comment by dzaima
6 hours ago
Unfortunately, that's only vector, and ≤16-bit ints at that, no 32-bit ints; and as the other reply says, nearly non-existent multiply-high which generally makes vectorized div-by-const its own mini-hell (but doing a 2x-width multiply with fixups is still better than the OP 4x-width method).
(...though, x86 does have (v)pmulhw for 16-bit input, so for 16-bit div-by-const the saturating option works out quite well.)
(And, for what it's worth, the lack of 8-bit multiplies on x86 means that the OP method of high-half-of-4x-width-multiply works out nicely for vectorizing dividing 8-bit ints too)
No comments yet
Contribute on Hacker News ↗