The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.
Horner's form is typically also more accurate, or at least, it is not bit-identical, so the compiler won't do it unless you pass -funsafe-math, and maybe not even then.
Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like
Actually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.
Compilers cannot do this optimization for floating point [1] unless you're compiling with -ffast-math. In general, don't rely on compilers to optimize floating point sub-expressions.
The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.
No, it's not equivalent for floating point, so a compiler won't do it unless you do -fassociative-math (or a superset, such as -ffast-math), in which case all correctness bets are off.
Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).
But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.
You should never assume the compiler is allowed to reorder floating-point computations like it does with integers. Integer math is exact, within its domain. Floating-point math is not. The IEEE-754 standard knows this, and the compiler knows this.
The compiler is often not allowed to rearrange such operations due to a change in intermediate results. So one would have to activate something like fastmath for this code, but that’s probably not desired for all code, so one has to introduce a small library, and so on. Debug builds may be using different compilation flags, and suddenly performance can become terrible while debugging. Performance can also tank because a new compiler version optimizes differently, etc. So in general I don’t think this advice is true.
Probably for ints unconditionally. For floats in Sesse__'s example without `-ffast-math`, I count 10 muls, 2 muladds, 1 add. With `-ffast-math`, 1 mul, 3 muladds. <https://godbolt.org/z/dPrbfjzEx>
Yeah, I once worked at a place where the compiler team was assigned the unpleasant task of implementing a long list of trigonometry functions. They struggled for many months to get the accuracy that was required of them, and when they did the performance was abysmal compared to the competition.
In hindsight, they probably didn't have anybody with the right background and should have contracted out the job. I certainly didn't have the necessary knowledge, either.
And similarly, entire generations of programmers were never taught Horner's scheme. You can see it in the article, where they write stuff like
(10 muls, 3 muladds)
instead of the faster
(1 mul, 3 muladds)
The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.
Horner's form is typically also more accurate, or at least, it is not bit-identical, so the compiler won't do it unless you pass -funsafe-math, and maybe not even then.
Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like
smth like that
Actually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.
The common subexpression elimination (CSE) pass in compilers takes care of that.
Compilers cannot do this optimization for floating point [1] unless you're compiling with -ffast-math. In general, don't rely on compilers to optimize floating point sub-expressions.
[1]: https://godbolt.org/z/8bEjE9Wxx
1 reply →
The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.
Not in this case because the dependencies are the same:
Naive: https://godbolt.org/z/Gzf1KM9Tc
Horner's: https://godbolt.org/z/jhvGqcxj1
Still, it's no worse than the naïve formula, which has exactly the same data dependencies and then more.
_Can_ you even make a reasonable high-ILP scheme for a polynomial, unless it's of extremely high degree?
3 replies →
Didn't know this technique had a name, but I would think a modern compiler could make this optimization on its own, no?
No, it's not equivalent for floating point, so a compiler won't do it unless you do -fassociative-math (or a superset, such as -ffast-math), in which case all correctness bets are off.
Is this outside of what compilers can do nowadays? (Or do they refuse because it's floating-point?)
Isn't that for... readability...?
Not just for speed, Horner can also be essential for numerical stability.
Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).
But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.
You should never assume the compiler is allowed to reorder floating-point computations like it does with integers. Integer math is exact, within its domain. Floating-point math is not. The IEEE-754 standard knows this, and the compiler knows this.
2 replies →
The compiler is often not allowed to rearrange such operations due to a change in intermediate results. So one would have to activate something like fastmath for this code, but that’s probably not desired for all code, so one has to introduce a small library, and so on. Debug builds may be using different compilation flags, and suddenly performance can become terrible while debugging. Performance can also tank because a new compiler version optimizes differently, etc. So in general I don’t think this advice is true.
Probably for ints unconditionally. For floats in Sesse__'s example without `-ffast-math`, I count 10 muls, 2 muladds, 1 add. With `-ffast-math`, 1 mul, 3 muladds. <https://godbolt.org/z/dPrbfjzEx>
Yeah, I once worked at a place where the compiler team was assigned the unpleasant task of implementing a long list of trigonometry functions. They struggled for many months to get the accuracy that was required of them, and when they did the performance was abysmal compared to the competition.
In hindsight, they probably didn't have anybody with the right background and should have contracted out the job. I certainly didn't have the necessary knowledge, either.