The original function is likely only going to be 3 instructions. xor, test, jne and only 1 of these is dependent on a previous instruction. In the "fast" version from the article there are 4 instructions with each depending on the previous instruction. I'm not surprised it lost in the benchmark.
I love these incomprehensible magic number optimizations. Every time I see one I wonder how many optimizations like this we missed back in the old days when we were writing all our inner loops in assembly?
It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.
The signum(x) function that is equivalent to CMP(X, 0) can be done in 3 or 4 instructions depending on your architecture without any comparison operations:
It is such a famous example, that compilers probably optimize CMP(X, 0) to that, but I have not checked. Coincidentally, the expansion of CMP(X, 0) is on the bit hacks list.
There are a few more superoptimized mathematical operations listed here:
Related, Computerphile had a video a few months ago where they try to put compute time relative to human time, similar to the way one might visualize an atom by making the proton the size of a golfball. I think it can help put some costs into perspective and really show why branching maters as well as the great engineering done to hide some of the slowdowns. But definitely some things are being marked simply by the sheer speed of the clock (like how the small size of a proton hides how empty an atom is)
Part-way through the section on bit-twiddling, I thought to myself "Oh I wonder if we could use a solver here". Lo and behold, I was pleasantly surprised to see the author then take that exact approach. Love the attention to detail in this post!
Taking a look at numbers in binary reveals some interesting patterns. Although seems obvious, it was interesting to me when I realized that all prime numbers except 2 end with 1.
This reminds me of once when I was giving an algo/ds interview (in Java) and the interviewer started asking me questions where one of the answers usually was “to be memorised” shit like this and he started pestering me to give him those answers (even though I said I don’t know and definitely don’t recall) and that too in C. As per him “everyone who coded knew C.. at least in college” and started becoming a bit more hostile. I think it was the first interview that I had ended as an interviewee.
I tend to be of the opinion that for modern general purpose CPUs in this era, such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.
But do you know what's not free? Memory accesses[1]. So when I'm optimizing things, I focus on making things more cache friendly.
The thing is about these optimisations (assuming they test as higher performance) is that they can get applied in a library and then everyone benefits from the speedup that took some hard graft to work out. Very few people bake their own date API nowadays if they can avoid it since it already exists and techniques like this just speed up every programme whether its on the critical path or not.
That's basically compilers these days. It used to be that you could try and optimize your code, inline things here and there, but these days, you're not going to beat the compiler optimization.
To temper this slightly, these sorts of optimizations are useful on embedded CPUs for device firmware, IOT, etc. I've worked on smart NIC CPUs where cycles were so precious we'd do all kinds of crazy unreadable things.
I suspect most IOT device manufacturers expect/design their device to be landfill before worrying about leap year math. (In my least optimistic moments, I suspect some of them may intentionally implement known broken algorithms that make their eWaste stop working correctly at some point in the near future that's statistically likely to bear beyond the warranty period.)
I'm amazed by the fact there is always someone who misinterprets Knuth's "premature optimization", reading as "don't optimize" instead of "pull out the profiler"
> modern CPUs are so fast that instructions are almost free.
Please don't.
These things compound. You especially need to consider typical computer usage involves using more than one application at a time. There's a tragedy of the commons issue that's often ignored. It can be if you're optimizing your code (you're minimizing your share!) but it can't be if you're not.
I guarantee you we'd have a lot of faster things if people invested even a little time (these also compound :). Two great examples might be Llama.cpp and FlashAttention. Both of these have had a huge impact of people (among a number of other works) but don't get nearly the same attention as other stuff. These are popular instances but I promise you that there's a million problems like these waiting to be solved. It's just not flashy, but hey plumbers and garbagemen are pretty critical jobs too
It's true that this code was optimized from 2.6ns down to 0.9ns, a saving of 1.7ns, while an L2 cache miss might be 80ns. But 1.7ns is still about 2% of the 80ns, and it's about 70% of the 2.6ns. You don't want to start optimizing by reducing things that are 2% of your cost, but 2% isn't insignificant.
The bigger issue is that probably you don't need to do leap-year checks very often so probably your leap-year check isn't the place to focus unless it's, like, sending a SQL query across a data center or something.
Just because CPU performance is increasing faster than DRAM speeds doesn't mean that CPU performance is "free" while memory is "expensive". One thing that you're ignoring is the impact of caches and prefetching logic which have significantly improved the performance of memory-bound workloads in the last 5-10 years. DRAM might be slow, but if you avoid going out to DRAM...
More broadly, it 100% depends on your workload. You'd be surprised at how many workloads are compute-bound, even today: LLM inference might be memory bound (thus how it's possible to get remotely good performance on a CPU), but training, esp. prefill, is very much not. And once you get out of LLMs, I'd say that most applications of ML tend to be compute bound.
Integer division (and modulo) is not cheap on most CPUs. Along with memory access and branch prediction, it is something worth optimizing for.
And since you are talking about memory. Code also goes in memory. Shorter code is more cache friendly.
I don't see a use case where it matters for this particular application (it doesn't mean there isn't) but well targeted micro-optimizations absolutely matter.
> I tend to be of the opinion that for modern general purpose CPUs in this era, such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.
What does this mean? Free? Optimisations are totally unnecessary because... instructions are free?
The implementation in TFA is probably on the order of 5x more efficient than a naive approach. This is time and energy as well. I don't understand what "free" means in this context.
Calendar operations are performed probably trillions of times every second across all types of computers. If you can make them more time- and energy-efficient, why wouldn't you?
If there's a problem with modern software it's too much bloat, not too much optimisation.
GP made an important point that you seemed to have missed: in modern architectures, it’s much more important to minimize memory access than to minimize instructions. They weren’t saying optimization isn’t important, they were describing how to optimize on modern systems.
If this is indeed done trillions of times a second, which I frankly have a hard time believing, then sure, it might be worth it. But on a modern CPU, focusing on an optimization like this is a poor use of developer resources. There are likely several other optimizations related to cache locality that you could find in less time than it would take to do this, and those other optimizations would probably give several orders of magnitude more improvement.
Not to mention that the final code is basically a giant WTF for anybody reading it. It will be an attractive nuisance that people will be drawn to, like moths to a flame, any time there is a bug around calendar operations.
> return ((y * 1073750999) & 3221352463) <= 126976;
> How does this work? The answer is surprisingly complex.
I don't think anyone is surprised in the complexity of any explanation for that algorithm :D
The original function is likely only going to be 3 instructions. xor, test, jne and only 1 of these is dependent on a previous instruction. In the "fast" version from the article there are 4 instructions with each depending on the previous instruction. I'm not surprised it lost in the benchmark.
I love these incomprehensible magic number optimizations. Every time I see one I wonder how many optimizations like this we missed back in the old days when we were writing all our inner loops in assembly?
Does anyone have a collection of these things?
Here is a short list:
https://graphics.stanford.edu/~seander/bithacks.html
It is not on the list, but #define CMP(X, Y) (((X) > (Y)) - ((X) < (Y))) is an efficient way to do generic comparisons for things that want UNIX-style comparators. If you compare the output against 0 to check for some form of greater than, less than or equality, the compiler should automatically simplify it. For example, CMP(X, Y) > 0 is simplified to (X > Y) by a compiler.
The signum(x) function that is equivalent to CMP(X, 0) can be done in 3 or 4 instructions depending on your architecture without any comparison operations:
https://www.cs.cornell.edu/courses/cs6120/2022sp/blog/supero...
It is such a famous example, that compilers probably optimize CMP(X, 0) to that, but I have not checked. Coincidentally, the expansion of CMP(X, 0) is on the bit hacks list.
There are a few more superoptimized mathematical operations listed here:
https://www2.cs.arizona.edu/~collberg/Teaching/553/2011/Reso...
Note that the assembly code appears to be for the Motorola 68000 processor and it makes use of flags that are set in edge cases to work.
Finally, there is a list of helpful macros for bit operations that originated in OpenSolaris (as far as I know) here:
https://github.com/freebsd/freebsd-src/blob/master/sys/cddl/...
There used to be an Open Solaris blog post on them, but Oracle has taken it down.
Enjoy!
We didn't miss them. In those days they weren't optimizations. Multiplications were really expensive.
Related, Computerphile had a video a few months ago where they try to put compute time relative to human time, similar to the way one might visualize an atom by making the proton the size of a golfball. I think it can help put some costs into perspective and really show why branching maters as well as the great engineering done to hide some of the slowdowns. But definitely some things are being marked simply by the sheer speed of the clock (like how the small size of a proton hides how empty an atom is)
and divides were worse. (1 cycle add, 10 cycle mult, 60 cycle div)
4 replies →
there is "Hacker's Delight" by Henry S. Warren, Jr.
https://en.wikipedia.org/wiki/Hacker's_Delight
Looks awesome, thank you :)
You should look at supercompilation.
Part-way through the section on bit-twiddling, I thought to myself "Oh I wonder if we could use a solver here". Lo and behold, I was pleasantly surprised to see the author then take that exact approach. Love the attention to detail in this post!
It's rare to read code that makes me literally laugh out loud. What a delight.
Taking a look at numbers in binary reveals some interesting patterns. Although seems obvious, it was interesting to me when I realized that all prime numbers except 2 end with 1.
If you need to know a leap year and it's before the year 6000, I made an interactive calculator and visualization [1].
It's >3 machine instructions (and I admire the mathematical tricks included in the post), but it does do thousands of calculations fairly quickly :)
[1] https://calculang.dev/examples-viewer?id=leap-year
At first I thought it would be using the BCD instructions in some way, as shown near the bottom of this article: https://news.ycombinator.com/item?id=8477254
Yeah, I was thinking AAM, but that wouldn't get you to 3.
I'd be impressed if an LLM derived this independently.
https://news.ycombinator.com/item?id=42915723#42924485
This is gold! HN Kino! Taking the hardest problem in the world, date checking, and casually bitflipping the hell out of it. Hats off man :D
This is fast, READABLE, and accurate:
bool is_leap_year(uint32_t y) { // Works for Gregorian years in range [0, 65535] return ((!(y & 3)) && ((y % 25 != 0) || !(y & 15))); }
You commented out your entire function body and the closing }. Also, on 32-bit platforms, it doesn't stop working at 65535.
just a formatting issue on my side, there were \n.
This impl is mentioned in TFA.. It's much slower and includes branches.
I'd expect even without optimizations on, there wouldn't be branches in the output for that code.
2 replies →
Interesting cppcon talk related to this (author is cited in TFA): https://www.youtube.com/watch?v=0s9F4QWAl-E
This reminds me of once when I was giving an algo/ds interview (in Java) and the interviewer started asking me questions where one of the answers usually was “to be memorised” shit like this and he started pestering me to give him those answers (even though I said I don’t know and definitely don’t recall) and that too in C. As per him “everyone who coded knew C.. at least in college” and started becoming a bit more hostile. I think it was the first interview that I had ended as an interviewee.
And I call this thing “bit gymnastics”.
Sounds like someone desperate to prove their superiority - now imagine working with them every day.
I tend to be of the opinion that for modern general purpose CPUs in this era, such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.
But do you know what's not free? Memory accesses[1]. So when I'm optimizing things, I focus on making things more cache friendly.
[1] http://gec.di.uminho.pt/discip/minf/ac0102/1000gap_proc-mem_...
The thing is about these optimisations (assuming they test as higher performance) is that they can get applied in a library and then everyone benefits from the speedup that took some hard graft to work out. Very few people bake their own date API nowadays if they can avoid it since it already exists and techniques like this just speed up every programme whether its on the critical path or not.
That's basically compilers these days. It used to be that you could try and optimize your code, inline things here and there, but these days, you're not going to beat the compiler optimization.
7 replies →
To temper this slightly, these sorts of optimizations are useful on embedded CPUs for device firmware, IOT, etc. I've worked on smart NIC CPUs where cycles were so precious we'd do all kinds of crazy unreadable things.
I suspect most IOT device manufacturers expect/design their device to be landfill before worrying about leap year math. (In my least optimistic moments, I suspect some of them may intentionally implement known broken algorithms that make their eWaste stop working correctly at some point in the near future that's statistically likely to bear beyond the warranty period.)
1 reply →
on the flip side of the topic, trying to do any datetime handling on the edge of embedded compute is going to be wrong 100% of the time anyway
1 reply →
> such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.
I'm amazed by the fact there is always someone who will say that such optimization are totally unnecessary.
I'm amazed by the fact there is always someone who misinterprets Knuth's "premature optimization", reading as "don't optimize" instead of "pull out the profiler"
Some people have significant positions on CPU manufacturers, so there will always be at least a few.
Please don't.
These things compound. You especially need to consider typical computer usage involves using more than one application at a time. There's a tragedy of the commons issue that's often ignored. It can be if you're optimizing your code (you're minimizing your share!) but it can't be if you're not.
I guarantee you we'd have a lot of faster things if people invested even a little time (these also compound :). Two great examples might be Llama.cpp and FlashAttention. Both of these have had a huge impact of people (among a number of other works) but don't get nearly the same attention as other stuff. These are popular instances but I promise you that there's a million problems like these waiting to be solved. It's just not flashy, but hey plumbers and garbagemen are pretty critical jobs too
You haven't refuted the parent comment at all. They asserted that instructions are insignificant, and other things, such as memory accesses, dominate.
It's true that this code was optimized from 2.6ns down to 0.9ns, a saving of 1.7ns, while an L2 cache miss might be 80ns. But 1.7ns is still about 2% of the 80ns, and it's about 70% of the 2.6ns. You don't want to start optimizing by reducing things that are 2% of your cost, but 2% isn't insignificant.
The bigger issue is that probably you don't need to do leap-year checks very often so probably your leap-year check isn't the place to focus unless it's, like, sending a SQL query across a data center or something.
Just because CPU performance is increasing faster than DRAM speeds doesn't mean that CPU performance is "free" while memory is "expensive". One thing that you're ignoring is the impact of caches and prefetching logic which have significantly improved the performance of memory-bound workloads in the last 5-10 years. DRAM might be slow, but if you avoid going out to DRAM...
More broadly, it 100% depends on your workload. You'd be surprised at how many workloads are compute-bound, even today: LLM inference might be memory bound (thus how it's possible to get remotely good performance on a CPU), but training, esp. prefill, is very much not. And once you get out of LLMs, I'd say that most applications of ML tend to be compute bound.
Integer division (and modulo) is not cheap on most CPUs. Along with memory access and branch prediction, it is something worth optimizing for.
And since you are talking about memory. Code also goes in memory. Shorter code is more cache friendly.
I don't see a use case where it matters for this particular application (it doesn't mean there isn't) but well targeted micro-optimizations absolutely matter.
This is why linear arrays are the fastest datastructure unless proven otherwise.
Quite right. If you use a compiled language (e.g. Go) the difference between the two implementations is indeed negligible.
https://go.dev/play/p/i72xCyhqRkC
> I tend to be of the opinion that for modern general purpose CPUs in this era, such micro-optimizations are totally unnecessary because modern CPUs are so fast that instructions are almost free.
What does this mean? Free? Optimisations are totally unnecessary because... instructions are free?
The implementation in TFA is probably on the order of 5x more efficient than a naive approach. This is time and energy as well. I don't understand what "free" means in this context.
Calendar operations are performed probably trillions of times every second across all types of computers. If you can make them more time- and energy-efficient, why wouldn't you?
If there's a problem with modern software it's too much bloat, not too much optimisation.
GP made an important point that you seemed to have missed: in modern architectures, it’s much more important to minimize memory access than to minimize instructions. They weren’t saying optimization isn’t important, they were describing how to optimize on modern systems.
If this is indeed done trillions of times a second, which I frankly have a hard time believing, then sure, it might be worth it. But on a modern CPU, focusing on an optimization like this is a poor use of developer resources. There are likely several other optimizations related to cache locality that you could find in less time than it would take to do this, and those other optimizations would probably give several orders of magnitude more improvement.
Not to mention that the final code is basically a giant WTF for anybody reading it. It will be an attractive nuisance that people will be drawn to, like moths to a flame, any time there is a bug around calendar operations.
2 replies →
I think about branches a lot too when optimizing
Related
> The world could run on older hardware if software optimization was a priority
https://news.ycombinator.com/item?id=43971464