Comment by patchnull

5 days ago

[flagged]

25 comments

patchnull

And similarly, entire generations of programmers were never taught Horner's scheme. You can see it in the article, where they write stuff like

  A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D

(10 muls, 3 muladds)

instead of the faster

  tmp = x * x;
  ((A * tmp + B) * tmp + C) * tmp + D

(1 mul, 3 muladds)

def-pri-pub 5 days ago
The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.
- pavpanchekha 5 days ago
  
  Horner's form is typically also more accurate, or at least, it is not bit-identical, so the compiler won't do it unless you pass -funsafe-math, and maybe not even then.
anematode 5 days ago
Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like
typedef double v2d __attribute__ ((vector_size (16))); v2d packed = { x, x }; packed = fma(packed, As, Bs); packed = fma(packed, Cs, Ds); // ... return x * packed[0] + packed[1]
smth like that
Actually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.
woadwarrior01 5 days ago
The common subexpression elimination (CSE) pass in compilers takes care of that.
- cmovq 5 days ago
  
  Compilers cannot do this optimization for floating point [1] unless you're compiling with -ffast-math. In general, don't rely on compilers to optimize floating point sub-expressions.
  [1]: https://godbolt.org/z/8bEjE9Wxx
  
  1 reply →
eska 5 days ago
The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.
- cmovq 5 days ago
  
  Not in this case because the dependencies are the same:
  Naive: https://godbolt.org/z/Gzf1KM9Tc
  Horner's: https://godbolt.org/z/jhvGqcxj1
- Sesse__ 5 days ago
  
  Still, it's no worse than the naïve formula, which has exactly the same data dependencies and then more.
  _Can_ you even make a reasonable high-ILP scheme for a polynomial, unless it's of extremely high degree?
  
  3 replies →
arkmm 5 days ago
Didn't know this technique had a name, but I would think a modern compiler could make this optimization on its own, no?
- Sesse__ 5 days ago
  
  No, it's not equivalent for floating point, so a compiler won't do it unless you do -fassociative-math (or a superset, such as -ffast-math), in which case all correctness bets are off.
zahlman 5 days ago

Is this outside of what compilers can do nowadays? (Or do they refuse because it's floating-point?)
3371 5 days ago

Isn't that for... readability...?
owlbite 5 days ago

Not just for speed, Horner can also be essential for numerical stability.
boothby 5 days ago
Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).
But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.
- exmadscientist 5 days ago
  
  You should never assume the compiler is allowed to reorder floating-point computations like it does with integers. Integer math is exact, within its domain. Floating-point math is not. The IEEE-754 standard knows this, and the compiler knows this.
  
  2 replies →
- eska 5 days ago
  
  The compiler is often not allowed to rearrange such operations due to a change in intermediate results. So one would have to activate something like fastmath for this code, but that’s probably not desired for all code, so one has to introduce a small library, and so on. Debug builds may be using different compilation flags, and suddenly performance can become terrible while debugging. Performance can also tank because a new compiler version optimizes differently, etc. So in general I don’t think this advice is true.
- scottlamb 5 days ago
  
  Probably for ints unconditionally. For floats in Sesse__'s example without `-ffast-math`, I count 10 muls, 2 muladds, 1 add. With `-ffast-math`, 1 mul, 3 muladds. <https://godbolt.org/z/dPrbfjzEx>

david-gpu 5 days ago

Yeah, I once worked at a place where the compiler team was assigned the unpleasant task of implementing a long list of trigonometry functions. They struggled for many months to get the accuracy that was required of them, and when they did the performance was abysmal compared to the competition.

In hindsight, they probably didn't have anybody with the right background and should have contracted out the job. I certainly didn't have the necessary knowledge, either.