← Back to context

Comment by purplesyringa

13 hours ago

The paper doesn't require a bitshift after multiplication -- it directly uses the high half of the product as the quotient, so it saves at least one tick over the solution you mentioned. And on x86, saturating addition can't be done in a tick and 32->64 zero-extension is implicit, so the distinction is even wider.

> And on x86, saturating addition can't be done in a tick

Perhaps I misunderstand your point, but I am rather sure that in SSE.../AVX... there do exist instructions for saturating addition:

* (V)PADDSB, (V)PADDSW, (V)PADDUSB, (V)PADDUSW

* (V)PHADDSW, (V)PHSUBSW

  • Unfortunately, that's only vector, and ≤16-bit ints at that, no 32-bit ints; and as the other reply says, nearly non-existent multiply-high which generally makes vectorized div-by-const its own mini-hell (but doing a 2x-width multiply with fixups is still better than the OP 4x-width method).

    (...though, x86 does have (v)pmulhw for 16-bit input, so for 16-bit div-by-const the saturating option works out quite well.)

    (And, for what it's worth, the lack of 8-bit multiplies on x86 means that the OP method of high-half-of-4x-width-multiply works out nicely for vectorizing dividing 8-bit ints too)

  • On x86, there is no vector instruction to get the upper half of integer product (64-bits x 64-bits). ARM SVE2 and RISC-V RVV have one, x86 unfortunately does not (and probably wont for a long time as AVX10 does not add it, either).

    • There is one for the f64 FMA recycling IFMA from AVX512 they have for bignum libraries;it's a 52 bit unsigned multiply and accumulates either the low or the high output halves into a 64bit accumulator.

      It's surely no 64 bit but it's much more than 32 bit. And it's giving you access to the high halves so you can use it to compute 32x32->64 on vector even if only half as packed as that could be.