Comment by Sesse__
5 days ago
And similarly, entire generations of programmers were never taught Horner's scheme. You can see it in the article, where they write stuff like
A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D
(10 muls, 3 muladds)
instead of the faster
tmp = x * x;
((A * tmp + B) * tmp + C) * tmp + D
(1 mul, 3 muladds)
The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.
Horner's form is typically also more accurate, or at least, it is not bit-identical, so the compiler won't do it unless you pass -funsafe-math, and maybe not even then.
Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like
smth like that
Actually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.
The common subexpression elimination (CSE) pass in compilers takes care of that.
Compilers cannot do this optimization for floating point [1] unless you're compiling with -ffast-math. In general, don't rely on compilers to optimize floating point sub-expressions.
[1]: https://godbolt.org/z/8bEjE9Wxx
Right, I totally forgot about floating point non associativity.
The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.
Not in this case because the dependencies are the same:
Naive: https://godbolt.org/z/Gzf1KM9Tc
Horner's: https://godbolt.org/z/jhvGqcxj1
Still, it's no worse than the naïve formula, which has exactly the same data dependencies and then more.
_Can_ you even make a reasonable high-ILP scheme for a polynomial, unless it's of extremely high degree?
For throughput-dominated contexts, evaluation via Horner's rule does very well because it minimizes register pressure and the number of operations required. But the latency can be relatively high, as you note.
There are a few good general options to extract more ILP for latency-dominated contexts, though all of them trade additional register pressure and usually some additional operation count; Estrin's scheme is the most commonly used. Factoring medium-order polynomials into quadratics is sometimes a good option (not all such factorizations are well behaved wrt numerical stability, but it also can give you the ability to synthesize selected extra-precise coefficients naturally without doing head-tail arithmetic). Quadratic factorizations are a favorite of mine because (when they work) they yield good performance in _both_ latency- and throughput-dominated contexts, which makes it easier to deliver identical results for scalar and vectorized functions.
There's no general form "best" option for optimizing latency; when I wrote math library functions day-to-day we just built a table of the optimal evaluation sequence for each order of polynomial up to 8 or so and each microarchitecture and grabbed the one we needed unless there were special constraints that required a different choice.
2 replies →
Didn't know this technique had a name, but I would think a modern compiler could make this optimization on its own, no?
No, it's not equivalent for floating point, so a compiler won't do it unless you do -fassociative-math (or a superset, such as -ffast-math), in which case all correctness bets are off.
Is this outside of what compilers can do nowadays? (Or do they refuse because it's floating-point?)
Isn't that for... readability...?
Not just for speed, Horner can also be essential for numerical stability.
Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).
But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.
You should never assume the compiler is allowed to reorder floating-point computations like it does with integers. Integer math is exact, within its domain. Floating-point math is not. The IEEE-754 standard knows this, and the compiler knows this.
Ah, fair point, it has been a while since I've needed fast inexact math.
Though... they are allowed to cache common subexpressions, and my point about dependency chains is quite relevant on modern hardware. So x*x, x*x*x, etc may each be computed once. And since arithmetic operators are left-to-right associative, the rather ugly code, as written, is fast and not as wasteful as it appears.
1 reply →
The compiler is often not allowed to rearrange such operations due to a change in intermediate results. So one would have to activate something like fastmath for this code, but that’s probably not desired for all code, so one has to introduce a small library, and so on. Debug builds may be using different compilation flags, and suddenly performance can become terrible while debugging. Performance can also tank because a new compiler version optimizes differently, etc. So in general I don’t think this advice is true.
Probably for ints unconditionally. For floats in Sesse__'s example without `-ffast-math`, I count 10 muls, 2 muladds, 1 add. With `-ffast-math`, 1 mul, 3 muladds. <https://godbolt.org/z/dPrbfjzEx>