← Back to context

Comment by qingcharles

8 hours ago

I love these incomprehensible magic number optimizations. Every time I see one I wonder how many optimizations like this we missed back in the old days when we were writing all our inner loops in assembly?

Does anyone have a collection of these things?

Here is a short list:

https://graphics.stanford.edu/~seander/bithacks.html

It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.

The signum(x) function that is equivalent to CMP(X, 0) can be done in 3 or 4 instructions depending on your architecture without any comparison operations:

https://www.cs.cornell.edu/courses/cs6120/2022sp/blog/supero...

It is such a famous example, that compilers probably optimize CMP(X, 0) to that, but I have not checked. Coincidentally, the expansion of CMP(X, 0) is on the bit hacks list.

There are a few more superoptimized mathematical operations listed here:

https://www2.cs.arizona.edu/~collberg/Teaching/553/2011/Reso...

Note that the assembly code appears to be for the Motorola 68000 processor and it makes use of flags that are set in edge cases to work.

Finally, there is a list of helpful macros for bit operations that originated in OpenSolaris (as far as I know) here:

https://github.com/freebsd/freebsd-src/blob/master/sys/cddl/...

There used to be an Open Solaris blog post on them, but Oracle has taken it down.

Enjoy!

  • For an entire book on this stuff, see Henry S. Warren Jr's Hackers Delight. The "three valued compare function" is in chapter 2, for example.

We didn't miss them. In those days they weren't optimizations. Multiplications were really expensive.

  • Multiplications of this word length, one should clarify. It's not that multiplication was an inherently more expensive or different operation back then (assuming from context here that the "old days" of coding inner loops in assembly language pre-date even the 32-bit ALU era). Binary multiplication has not changed in millennia. Ancient Egyptians were using the same binary integer multiplication logic 5 millennia ago as ALUs do today.

    It was that generally the fast hardware multiplication operations in ALUs didn't have very many bits in the register word length, so multiplications of wider words had to be done with library functions that did long multiplication in (say) base 256.

    So this code in the headlined article would not be "three instructions" but three calls to internal helper library functions used by the compiler for long-word multiplication, comparison, and bitwise AND; not markedly more optimal than three internal helper function calls for the three original modulo operations, and in fact less optimal than the bit-twiddled modulo-powers-of-2 version found halfway down the headlined article, which would only need check the least significant byte and not call library functions for two of the 32-bit modulo operations.

    Bonus points to anyone who remembers the helper function names in Microsoft BASIC's runtime library straight off the top of xyr head. It is probably a good thing that I finally seem to have forgotten them. (-: They all began with "B$" as I recall.

  • Related, Computerphile had a video a few months ago where they try to put compute time relative to human time, similar to the way one might visualize an atom by making the proton the size of a golfball. I think it can help put some costs into perspective and really show why branching maters as well as the great engineering done to hide some of the slowdowns. But definitely some things are being marked simply by the sheer speed of the clock (like how the small size of a proton hides how empty an atom is)

      https://youtube.com/watch?v=PpaQrzoDW2I

  • and divides were worse. (1 cycle add, 10 cycle mult, 60 cycle div)