Comment by ryao

2 months ago

Here is a short list:

https://graphics.stanford.edu/~seander/bithacks.html

It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.

The signum(x) function that is equivalent to CMP(X, 0) can be done in 3 or 4 instructions depending on your architecture without any comparison operations:

https://www.cs.cornell.edu/courses/cs6120/2022sp/blog/supero...

It is such a famous example, that compilers probably optimize CMP(X, 0) to that, but I have not checked. Coincidentally, the expansion of CMP(X, 0) is on the bit hacks list.

There are a few more superoptimized mathematical operations listed here:

https://www2.cs.arizona.edu/~collberg/Teaching/553/2011/Reso...

Note that the assembly code appears to be for the Motorola 68000 processor and it makes use of flags that are set in edge cases to work.

Finally, there is a list of helpful macros for bit operations that originated in OpenSolaris (as far as I know) here:

https://github.com/freebsd/freebsd-src/blob/master/sys/cddl/...

There used to be an Open Solaris blog post on them, but Oracle has taken it down.

Enjoy!

11 comments

ryao

JdeBP 2 months ago

For an entire book on this stuff, see Henry S. Warren Jr's Hackers Delight. The "three valued compare function" is in chapter 2, for example.

eru 2 months ago

> It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.

I guess this only applies when the compiler knows what version of > you are using?

Eg it might not work in C++ when < and > are overloaded for eg strings?

ryao 2 months ago
My comment had been meant for C, but it should apply to C++ too even when operator overloading is used, provided the comparisons are simple and inlined. If you add overloads for the > and < operators in your string example to a place where they would inline, and the overload compares .length(), this should simplify. For example, godbolt shows that CMP(X, Y) == 0 is optimized to one mov instruction and one cmp instruction despite operator overloads when I implement your string example:
https://godbolt.org/z/nGbPhz86q
If you did not inline the operator overloads and had them in another compilation unit, do not expect this to simplify (unless you use LTO).
If you have compound comparators in the operator overloads (such that on equality in one field, it considers a second for a tie breaker), I would not expect it to simplify, although the compiler could surprise me.
- eru 2 months ago
  
  I was more thinking of eg lexicographic comparisons of strings, not just comparing by length.
  Yes, if you have a smart enough compiler, or a simple enough comparison, this will simplify.
  
  2 replies →
trollbridge 2 months ago
The compiler would resolve that before the optimiser.
- eru 2 months ago
  
  I'd like to see that resolved for eg lexicographic comparison of strings.

kmoser 2 months ago

There's also this classic: https://en.wikipedia.org/wiki/Fast_inverse_square_root

ryao 2 months ago

That is an approximation. If approximations are acceptable, then here is a trick you might like. In loops that call cosf(i * C) and/or sinf(i * C), where i is incremented by 1 on each iteration and C is some constant expression, you can call cosf() and sinf() once (or twice if i starts at something other than 0 or 1) outside of the loop and use the angle addition formula to do accumulation via multiplication and addition inside the loop. The loop will run significantly faster.
Even if you only need one of cosf() or sinf(), many CPUs calculate both values at the same time, so taking the other is free. If you only need single precision values, you can do this in double precision to avoid much of the errors you would get by doing this in single precision.
This trick can be used to accelerate the RoPE relative positional encoding calculations used in inference for llama 3 and likely others. I have done this and seen a measurable speed up, although these calculations are such a small part of inference that it was a small improvement.