Comment by foltik

13 hours ago

Compilers already optimize division of a 32 bit integer x by a constant d like:

  c = 2^a / d
  so: d = 2^a / c
  so: x / d = (x * c) / 2^a

And /2^a is equivalent to >>a, so we have:

  x / d = (x * c) >> a

Which is just a multiply and shift, cheaper than integer division.

Problem is, with some divisors like 7, 14, 21, ... the smallest c that works to produce an exact result is 33 bits. Awkward on 32 bit CPUs, it just barely doesn’t fit but you can account for that extra 1 bit manually with an extra sequence of sub, shift, add.

What the paper points out is twofold:

1. On 64 bit CPUs the extra sequence isn’t required, you can just do a full 64x64 bit multiply and 128 bit shift. Back to just a multiply and shift like the original optimization, and already faster than the workaround.

2. Even better, the shift itself can be optimized away. A 64x64 bit multiply stores the 128 bit product in a pair of 64 bit registers, meaning you basically get >> 64 for free by just looking at the upper half register in the pair. That means if you pre-shift the constant to:

  c’ = c << (64 - a)

Now (x * c’) >> 64 is equivalent to (x * c) >> a. And the result is just waiting for you in a 64 bit register, no shift needed.