Faster asin() was hiding in plain sight

5 days ago (16bpp.net)

139 comments

def-pri-pub

While I'm glad to see the OP got a good minimax solution at the end, it seems like the article missed clarifying one of the key points: error waveforms over a specified interval are critical, and if you don't see the characteristic minimax-like wiggle, you're wasting easy opportunity for improvement.

Taylor series in general are a poor choice, and Pade approximants of Taylor series are equally poor. If you're going to use Pade approximants, they should be of the original function.

I prefer Chebyshev approximation: https://www.embeddedrelated.com/showarticle/152.php which is often close enough to the more complicated Remez algorithm.

rockmeamedee 4 days ago

I had no idea, but this "wiggle" is required for an optimal approximation, it's called the "equioscillation property" [https://en.wikipedia.org/wiki/Equioscillation_theorem].
For a polynomial P (of degree n) to approximate a function F on the real numbers with minimal absolute error, the max error value of |P - F| needs to be hit multiple times, (n+2 times to be precise). You need to have the polynomial "wiggle" back and forth between the top of the error bound and the bottom.
And even more surprisingly, this is a necessary _and sufficient_! condition for optimality. If you find a polynomial whose error alternates and it hits its max error bound n+2 times, you know that no other polynomial of degree n can do better, that is the best error bound you can get for degree n.
Very cool!
ogogmad 5 days ago
Chebyshev polynomials cos(n arcos(x)) provide one of the proofs that every continuous function f:[0,1]->R can be uniformly approximated by polynomial functions. Bernstein polynomials provide a shorter proof, but perhaps not the best numerical method: https://en.wikipedia.org/wiki/Bernstein_polynomial#See_also
- AlotOfReading 5 days ago
  
  Those don't guarantee that that they can be well approximated by a polynomial of degree N though, like we have here. You can apply Jackson's inequality to calculate a maximum error bound, but the epsilon for degree 5 is pretty atrocious.
zackmorris 3 days ago

Thanks so much for that!
I've been struggling to curve fit an aerodynamics equation relating Mach number to rocket nozzle exit/entrance area ratio for quite some time. It's a 5th or 6th degree polynomial whose inverse doesn't have a closed-form solution:
https://en.wikipedia.org/wiki/Abel%E2%80%93Ruffini_theorem
But I was able to use a Chebyshev fit that is within a few percent accurate at 3rd degree, and is effectively identical at 4th degree or higher. And 4th degree (quartic) polynomials do have a closed-form solution for their inverse. That lets me step up to higher abstractions for mass flow rate, power, etc without having to resort to tables. At most, I might need to use piecewise-smooth sections, which are far easier to work with since they can just be dropped into spreadsheets, used for derivatives/integrals, etc.
Anyway, I also discovered (ok AI mentioned) that the Chebyshev approximation is based on the discrete cosine transform (DCT):
https://en.wikipedia.org/wiki/Discrete_Chebyshev_transform#R...
https://en.wikipedia.org/wiki/Discrete_cosine_transform#Appl...
That's why it's particularly good at curve-fitting in just 2 or 3 terms. Which is why they use the DCT for image compression in JPG etc:
https://www.mathworks.com/help/images/discrete-cosine-transf...
The secret sauce is that Chebyshev approximation spreads the error as ripples across the function, rather than at its edges like with Taylor series approximation. That helps it fit more intricate curves and arbitrary data points, as well as mesh better with neighboring approximations.

xt00 5 days ago

To be accurate, this is originally from Hastings 1955, Princeton "APPROXIMATIONS FOR DIGITAL COMPUTERS BY CECIL HASTINGS", page 159-163, there are actually multiple versions of the approximation with different constants used. So the original work was done with the goal of being performant for computers of the 1950's. Then the famous Abramowitz and Stegun guys put that in formula 4.4.45 with permission, then the nvidia CG library wrote some code that was based upon the formula, likely with some optimizations.

rerdavies 5 days ago
I ran this down, because I have a particular interest in vectorizable function approximations. Particular those that exploit bit-banging to handle range normalization. (Anyone have a good reference for that?)
Regrettably, this is NOT from Hastings 1955. Hastings provides Taylor series and Chebyshev polynomial approximations. The OP's solution is a Pade approximation, which are not covered at all in Hastings.
- xt00 5 days ago
  
  When you say "this is NOT from Hastings" I had to double check my post again -- I guess you are saying that the Pade approximation is not from Hastings, but the polynomial approximation that the OP referenced from nvidia from A&S and ultimately from Hastings, definitely is in Hastings on page 159 -- I think you were referring to the Pade approximation not being in Hastings, which appears to be true yes. In the article it is interesting that the OP tried taylor expansion and pade approximation, but not the fairly standard "welp lets just fit a Nth order polynomial to the arcsin" which is what Hastings did back in the day.
- adampunk 4 days ago
  
  >Particular those that exploit bit-banging to handle range normalization
  https://userpages.cs.umbc.edu/phatak/645/supl/Ng-ArgReductio...
  That's tiny but weird.

exmadscientist 5 days ago

This line:

> This amazing snippet of code was languishing in the docs of dead software, which in turn the original formula was scrawled away in a math textbook from the 60s.

was kind of telling for me. I have some background in this sort of work (and long ago concluded that there was pretty much nothing you can do to improve on existing code, unless either you have some new specific hardware or domain constraint, or you're just looking for something quick-n-dirty for whatever reason, or are willing to invest research-paper levels of time and effort) and to think that someone would call Abramowitz and Stegun "a math textbook from the 60s" is kind of funny. It's got a similar level of importance to its field as Knuth's Art of Computer Programming or stuff like that. It's not an obscure text. Yeah, you might forget what all is in it if you don't use it often, but you'd go "oh, of course that would be in there, wouldn't it...."

mkbosmans 5 days ago
One of the ways that the classics can be improved is not to take the analytic ideal coefficients and approximate them to the closest floating point number, but rather take those ideal coefficients as a starting point for a search of slightly better ones.
The SLEEF Vectorized Math Library [1] does this and therefore can usually provide accuracy guarantees for the whole floating point range with a polynomial order lower than theory would predict.
Its asinf function [2] is accurate to 1 ULP for all single precision floats, and is similar to the `asin_cg` from the article, with the main difference the sqrt is done on the input of the polynomial instead of the output.
[1] https://sleef.org/ [2] https://github.com/shibatch/sleef/blob/master/src/libm/sleef...
- mkbosmans 5 days ago
  
  I'm sorry, that second reference was actually for the 3.5ULP variant. The 1 ULP is here: https://github.com/shibatch/sleef/blob/master/src/libm/sleef...
wolfi1 5 days ago
Abramowitz/Stegun has been updated 2010 and resides now here: https://dlmf.nist.gov/
- rerdavies 5 days ago
  
  Doesn't seem to be terribly up to date though. It seems to use almost exclusively taylor series, and seems to be completely uninterested in error analysis of any kind. Unless I'm missing something.
  
  1 reply →
bsder 5 days ago

Yeah, if you want something that's somewhat obscure, pull up Cody and Waite "Software Manual for the Elementary Functions".
And, lo and behold, the ASIN implementation is minimax.
def-pri-pub 5 days ago
These are books that my uni courses never had me read. I'm a little shocked at times at how my degree program skimped on some of the more famous texts.
- neutronicus 5 days ago
  
  It is not a textbook, it is an extremely dense reference manual, so that honestly makes sense.
  In physics grad school, professors would occasionally allude to it, and textbooks would cite it ... pretty often. So it's a thing anyone with postgraduate physics education should know exists, but you wouldn't ever be assigned it.
  
  2 replies →
- eesmith 4 days ago
  
  I didn't need Abramowitz and Stegun until grad school. In the 1990s. It was a well-known reference book for people at that level, not a text book.
  For my undergrad the CRC math handbook was enough.
axus 5 days ago

In this case, AI understood history better than the human.

LegionMammal978 5 days ago

In general, I find that minimax approximation is an underappreciated tool, especially the quite simple Remez algorithm to generate an optimal polynomial approximation [0]. With some modifications, you can adapt it to optimize for either absolute or relative error within an interval, or even come up with rational-function approximations. (Though unfortunately, many presentations of the algorithm use overly-simple forms of sample point selection that can break down on nontrivial input curves, especially if they contain small oscillations.)

[0] https://en.wikipedia.org/wiki/Remez_algorithm

herf 5 days ago
They teach a lot of Taylor/Maclaurin series in Math classes (and trig functions are sometimes called "CORDIC" which is an old method too) but these are not used much in actual FPUs and libraries. Maybe we should update the curricula so people know better ways.
- bee_rider 5 days ago
  
  Taylor series makes a lot more sense in a math class, right? It is straightforward and (just for example), when you are thinking about whether or not a series converges in the limit, why care about the quality of the approximation after a set number of steps?
  
  2 replies →
jason_s 5 days ago
Not sure I would call Remez "simple"... it's all relative; I prefer Chebyshev approximation which is simpler than Remez.
- LegionMammal978 5 days ago
  
  Perhaps, but at least I find it very simple for the optimality properties it gives: there is no inherent need to say, "I know that better parameters likely exist, but the algorithm to find them would be hopelessly expensive," as is the case in many forms of minimax optimization.
- stephencanon 5 days ago
  
  Ideally either one is just a library call to generate the coefficients. Remez can get into trouble near the endpoints of the interval for asin and require a little bit of manual intervention, however.

AlotOfReading 5 days ago

I'm pretty sure it's not faster, but it was fun to write:

    float asin(float x) {
      float x2 = 1.0f-fabs(x);
      u32 i = bitcast(x2);
      i = 0x5f3759df - (i>>1);
      float inv = bitcast(i);
      return copysign(pi/2-pi/2*(x2*inv),x);
    }

Courtesy of evil floating point bithacks.

teo_zero 5 days ago
The bad thing about this method is that it's slower than native CPU instructions. The good thing is that the result is very precise for at least 2 values of x, namely 1.0 and -1.0
JK
- AlotOfReading 4 days ago
  
  Yeah, it's an interesting approximation. It follows the shape of the curve pretty well except around 0, much better than low degree polynomials. Polynomial approximations have much better relative error bounds though.
jacquesm 5 days ago
That could do with some subtitles.
- irishcoffee 5 days ago
  
  https://en.wikipedia.org/wiki/Fast_inverse_square_root
def-pri-pub 5 days ago

> floating point bithacks
The forbidden magic
adampunk 5 days ago

// what the fuck
chuckadams 5 days ago

You brought Zalgo. I blame this decade on you.
moffkalast 5 days ago

> float asinine(float x) {
FTFY :P

cmovq 5 days ago

> After all of the above work and that talk in mind, I decided to ask an LLM.

Impressive that an LLM managed to produce the answer from a 7 year old stack overflow answer all on its own! [1] This would have been the first search result for “fast asin” before this article was published.

[1]: https://stackoverflow.com/a/26030435

def-pri-pub 5 days ago
I did see that, but isn't the vast majority of that page talking about acos() instead?
- seanhunter 5 days ago
  
  That’s equivalent right? acos x = pi/2 - asin x
  So if you’ve got one that’s fast you have them both.

scottlamb 5 days ago

Isn't the faster approach SIMD [edit: or GPU]? A 1.05x to 1.90x speedup is great. A 16x speedup is better!

They could be orthogonal improvements, but if I were prioritizing, I'd go for SIMD first.

I searched for asin on Intel's intrinsics guide. They have a AVX-512 instrinsic `_mm512_asin_ps` but it says "sequence" rather than single-instruction. Presumably the actual sequence they use is in some header file somewhere, but I don't know off-hand where to look, so I don't know how it compares to a SIMDified version of `fast_asin_cg`.

https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

TimorousBestie 5 days ago
I don’t know much about raytracing but it’s probably tricky to orchestrate all those asin calls so that the input and output memory is aligned and contiguous. My uneducated intuition is that there’s little regularity as to which pixels will take which branches and will end up requiring which asin calls, but I might be wrong.
- scottlamb 5 days ago
  
  I'd expect it to come down to data-oriented design: SoA (structure of arrays) rather than AoS (array of structures).
  I skimmed the author's source code, and this is where I'd start: https://github.com/define-private-public/PSRayTracing/blob/8...
  Instead of an `_objects`, I might try for a `_spheres`, `_boxes`, etc. (Or just `_lists` still using the virtual dispatch but for each list, rather than each object.) The `asin` seems to be used just for spheres. Within my `Spheres::closest_hit` (note plural), I'd work to SIMDify it. (I'd try to SIMDify the others too of course but apparently not with `asin`.) I think it's doable: https://github.com/define-private-public/PSRayTracing/blob/8...
  I don't know much about ray tracers either (having only written a super-naive one back in college) but this is the general technique used to speed up games, I believe. Besides enabling SIMD, it's more cache-efficient and minimizes dispatch overhead.
  edit: there's also stuff that you can hoist in this impl. Restructuring as SoA isn't strictly necessary to do that, but it might make it more obvious and natural. As an example, this `ray_dir.length_squared()` is the same for the whole list. You'd notice that when iterating over the spheres. https://github.com/define-private-public/PSRayTracing/blob/8...
  
  4 replies →
Am4TIfIsER0ppos 5 days ago
I don't do much float work but I don't think there is a single regular sine instruction only old x87 float stack ones.
I was curious what "sequence" would end up being but my compiler is too old for that intrinsic. Even godbolt didn't help for gcc or clang but it did reveal that icc produced a call https://godbolt.org/z/a3EsKK4aY
- nitwit005 5 days ago
  
  If you click libraries on godbolt, it's pulling in a bunch, including multiple SIMD libraries. You might have to fiddle with the libraries or build locally.
qingcharles 4 days ago

The issue is that the algorithm is only half the story. The implementation (e.g. bytecode) is the other.
I've been trying to find ways to make the original graphics renderer of the CGA version of Elite faster as there have been dozens of little optimizations found over the decades since it was written.
I was buoyed by a video of Super Mario 64/Zelda optimizations where it was pointed out that sometimes an approx calculation of a trig value can be quicker than a table lookup depending on the architecture.
Based on that I had conversations with LLMs over what fast trig algorithms there are, but for 8088 you are cooked most of the time on implementing them at speed.

orangepanda 5 days ago

> Nobody likes throwing away work they've done

I like throwing away work I've done. Frees up my mental capacity for other work to throw away.

WithinReason 5 days ago

> In any graphics application trigonometric functions are frequently used.

Counterpoint from the man himself, "avoiding trigonometry":

https://iquilezles.org/articles/noacos/

djmips 4 days ago

And further to that. https://fgiesen.wordpress.com/2010/10/21/finish-your-derivat...

sixo 5 days ago

It appears that the real lesson here was to lean quite a bit more on theory than a programmer's usual roll-your-own heuristic would suggest.

A fantastic amount of collective human thought has been dedicated to function approximations in the last century; Taylor methods are over 200 years old and unlikely to come close to state-of-the-art.

tzs 4 days ago

In that Padé approximant I think you can save a couple multiplications.

As written it does this:

  n = 1 - 367/714 * x**2
  d = 1 - 81/119 * x**2 + 183/4760 * x**4
  return x * (n/d)

That's got 7 multiplies (I'm counting divide as a multiply) and 3 additions. (I'm assuming the optimizer only computes x^2 once and computes x^2 by squaring x^2, and that all the constants are calculated at compile time).

Replace n/d with 1/(d/n) and then replace d/n with q + r/n where q is the quotient of polynomial d divided by polynomial n and r is the remainder.

This is the result:

  n = 1 - 367/714 * x**2
  q = 1587627/1346890 - 549/7340 * x**2
  r = -240737/1346890
  return x / (q + r/n)

That's got 5 multiplies and 3 additions.

kazinator 4 days ago

The glibc implementation already has tests for several ranges and hacks for them including Taylor series:

https://github.com/lattera/glibc/blob/master/sysdeps/ieee754...

The smallest range is |x| < 1.49011611938477e-8. In this case the routine just returns x, after calling some underflow-checking routine.

So right there, if we neglect this detail in our own wrapper, we may be able to get a speedup, at the cost of sending very small values to the Tayor series.

The next smallest range tested is |x| < 0.125, and after that |x| < 0.5 and 0.75.

The authors are cetainly not missing any brilliant trick hiding in plain sight; they are doing a more assiduous job.

erichocean 5 days ago

Ideal HN content, thanks!

djmips 4 days ago

To the blog poster:

Robin Green is an excellent resource

Faster Math Functions:

https://basesandframes.wordpress.com/wp-content/uploads/2016...

Even faster math functions GDC 2020:

https://www.gdcvault.com/play/1027337/Math-in-Game-Developme...

adampunk 5 days ago

We love to leave faster functions languishing in library code. The basis for Q3A’s fast inverse square root had been sitting in fdlibm since 1986, on the net since 1993: https://www.netlib.org/fdlibm/e_sqrt.c

def-pri-pub 5 days ago
Funny enough that fdlimb implementation of asin() did come up in my research. I believe it might have been more performant in the past. But taking a quick scan of `e_asin.c`, I see it doing something similar to the Cg asin() implementation (but with more terms and more multiplications, which my guess is that it's slower). I think I see it also taking more branches (which could also lead to more of a slowdown).
- adampunk 5 days ago
  
  Yeah Ng’s work in fdlibm is cool and really clever in parts but a lot of branching. Some of the ways they reach correct rounding are…so cool.

Skeime 5 days ago

Wouldn't it also be much better to evaluate the Taylor polynomials using Horner's method, instead? (Maybe C++ can do this automatically, but given that there might be rounding differences, it probably won't.)

tyleo 4 days ago

I had to do an atan() on an slow embedded device once for an autonomous robot competition.

Fastest impl I came up with was rounding and big switch statement.

glitchc 5 days ago

The 4% improvement doesn't seem like it's worth the effort.

On a general note, instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs. So, replacing one with the other will not confer much benefit, as evidenced from the results. They're all typically implemented using LUTs, and it's hard to beat the performance of an optimized LUT, which is basically a multiplexer connected to some dedicated memory cells in hardware.

def-pri-pub 5 days ago
You'd be surprised how it actually is worth the effort, even just a 1% improvement. If you have the time, this is a great talk to listen to: https://www.youtube.com/watch?v=kPR8h4-qZdk
For a little toy ray tracer, it is pretty measly. But for a larger corporation (with a professional project) a 4% speed improvement can mean MASSIVE cost savings.
Some of these tiny improvements can also have a cascading effect. Imagining finding a +4%, a +2% somewhere else, +3% in neighboring code, and a bunch of +1%s here and there. Eventually you'll have built up something that is 15-20% faster. Down the road you'll come across those optimizations which can yield the big results too (e.g. the +25%).
- glitchc 5 days ago
  
  It's a cool talk, but the relevance to the present problem escapes me.
  If you're alluding to gcc vs fbstring's performance (circa 15:43), then the performance improvement is not because fbstring is faster/simpler, but due to a foundational gcc design decision to always use the heap for string variables. Also, at around 16:40, the speaker concedes that gcc's simpler size() implementation runs significantly faster (3x faster at 0.3 ns) when the test conditions are different.
kstrauser 5 days ago
> The 4% improvement doesn't seem like it's worth the effort.
People have gotten PhDs for smaller optimizations. I know. I've worked with them.
> instructions like division and square root are roughly equal to trig functions in cycle count on modern CPUs.
What's the x86-64 opcode for arcsin?
- glitchc 5 days ago
  
  > What's the x86-64 opcode for arcsin?
  Not required. ATAN and SQRTS(S|D) are sufficient, the half-angle approach in the article is the recommended way.
  > People have gotten PhDs for smaller optimizations. I know. I've worked with them.
  I understand the can, not sure about the should. Not trying to be snarky, we just seem to be producing PhDs with the slimmest of justifications. The bar needs to be higher.
  
  1 reply →
- adrian_b 5 days ago
  
  Presumably the poster meant polynomial approximations of trigonometric functions not instructions for trigonometric functions, which are missing in most CPUs, though many GPUs have such instructions.
  x86-64 had instructions for the exponential and logarithmic functions in Xeon Phi, but those instructions have been removed in Skylake Server and the later Intel or AMD CPUs with AVX-512 support.
  However, instructions for trigonometric functions have no longer been added after Intel 80387, and those of 8087 and 80387 are deprecated.
tverbeure 5 days ago

> The 4% improvement doesn't seem like it's worth the effort.
I've spent the past few months improving the performance of some work thing by ~8% and the fun I've been having reminds me of the nineties, when I tried to squeeze every last % of performance out of the 3D graphics engine that I wrote as a hobby.
charcircuit 5 days ago

The effort of typing about 10 words into a LLM is minimal.

coloneljelly 5 days ago

In DSP math, it is common to use Chebyshev polynomial approximation. You can get incredibly precise results within your required bounds.

andyjohnson0 5 days ago

Interesting article. A few years back I implemented a bunch of maths primitives, including trig functions, using Taylor sequences etc, to see how it was done. An interesting challenge, even at the elementary level I was working at.

So this article got me wondering how much accuracy is needed before computing a series beats pre-computed lookup tables and interpolation. Anyone got any relevant experience to share?

How much accuracy does ray tracing require?

peterabbitcook 5 days ago

I am curious, did you check how much your benchmarks moved (time and errors) if at all if you told the compiler to use —-use_fast_math or -ffast-math?

There’s generally not a faster version of inverse trig functions to inline, but it might optimize some other stuff out.

Unrelated to that, I’ve seen implementations (ie julia/base/special/trig) that use a “rational approximation” to asin, did you go down that road at any point?

debo_ 5 days ago

https://bash-org-archive.com/?427792

Ono-Sendai 4 days ago

Here's my fast acos, which I think can be converted to an asin: https://forwardscattering.org/post/66

empiricus 5 days ago

Does anyone knows the resources for the algos used in the HW implementations of math functions? I mean the algos inside the CPUs and GPUs. How they make a tradeoff between transistor number, power consumption, cycles, which algos allow this.

groos 5 days ago
Here's one way to do it.
https://en.wikipedia.org/wiki/CORDIC
- empiricus 4 days ago
  
  Thanks, but this seems to be optimized for the smallest number of gates, so it applies for simple microcontrollers and FPGA, and with limited precision. I was interested in actual state of the art used in modern CPUs and GPUs.

veltas 5 days ago

Just a point that the constexpr/const use in that C++ code makes no difference to output, and is just noise really.

drsopp 5 days ago

Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.

Pannoniae 5 days ago
Microbenchmarks. A LUT will win many of them but you pessimise the rest of the code. So unless a significant (read: 20+%) portion of your code goes into the LUT, there isn't that much point to bother. For almost any pure calculation without I/O, it's better to do the arithmetic than to do memory access.
- jcalvinowens 5 days ago
  
  Locality within the LUT matters too: if you know you're looking up identical or nearby-enough values to benefit from caching, an LUT can be more of a win. You only pay the cache cost for the portion you actually touch at runtime.
  I could imagine some graphics workloads tend compute asin() repeatedly with nearby input values. But I'd guess the locality isn't local enough to matter, only eight double precision floats fit in a cache line.
- hrmtst93837 5 days ago
  
  Cache size and replacement policies can ruin even a well-tuned LUT once your working set grows or other threads spray cache lines so "just use a LUT" quietly turns into "debug the perf cliff" later. If the perf gain disappears under load or with real input sets you realise too late it was just a best-case microbenchmark trick.
groundzeros2015 5 days ago

I don’t want to fill up L1 for sin.
jcalvinowens 5 days ago
Surely the loss in precision of a 32KB LUT for double precision asin() would be unacceptable?
- Jyaif 5 days ago
  
  By interpolating between values you can get excellent results with LUTs much smaller than 32KB. Will it be faster than the computation from op, that I don't know.
  
  4 replies →

varispeed 5 days ago

If you are interested in such "tricks", you should check out the classic Hacker's Delight by Henry Warren

stephc_int13 5 days ago

My favorite tool to experiment with math approximation is lolremez. And you can easily ask your llm to do it for you.

ok123456 5 days ago

Chebyshev approximation for asin is sum(2T_n(x) / (pi*n*n),n), the even terms are 0.

patchnull 5 days ago

[flagged]

Sesse__ 5 days ago
And similarly, entire generations of programmers were never taught Horner's scheme. You can see it in the article, where they write stuff like
A * x * x * x * x * x * x + B * x * x * x * x + C * x * x + D
(10 muls, 3 muladds)
instead of the faster
tmp = x * x; ((A * tmp + B) * tmp + C) * tmp + D
(1 mul, 3 muladds)
- def-pri-pub 5 days ago
  
  The reason for writing out all of the x multiplications like that is that I was hoping the compiler detect such a pattern perform an optimization for me. Mat Godbolt's "Advent of Compiler Optimizations" series mentions some of these cases where the compiler can do more auto-optimizations for the developer.
  
  1 reply →
- anematode 5 days ago
  
  Yep, good stuff. Another nice trick to extract more ILP is to split it into even/odd exponents and then recombine at the end (not sure if this has a name). This also makes it amenable to SLP vectorization (although I doubt the compiler will do this nicely on its own). For example something like
  typedef double v2d __attribute__ ((vector_size (16))); v2d packed = { x, x }; packed = fma(packed, As, Bs); packed = fma(packed, Cs, Ds); // ... return x * packed[0] + packed[1]
  smth like that
  Actually one project I was thinking of doing was creating SLP vectorized versions of libm functions. Since plenty of programs spend a lot of time in libm calling single inputs, but the implementation is usually a bunch of scalar instructions.
- woadwarrior01 5 days ago
  
  The common subexpression elimination (CSE) pass in compilers takes care of that.
  
  2 replies →
- eska 5 days ago
  
  The problem with Horner’s scheme is that it creates a long chain of data dependencies, instead of making full use of all execution units. Usually you’d want more of a binary tree than a chain.
  
  5 replies →
- arkmm 5 days ago
  
  Didn't know this technique had a name, but I would think a modern compiler could make this optimization on its own, no?
  
  1 reply →
- zahlman 5 days ago
  
  Is this outside of what compilers can do nowadays? (Or do they refuse because it's floating-point?)
- 3371 5 days ago
  
  Isn't that for... readability...?
- owlbite 5 days ago
  
  Not just for speed, Horner can also be essential for numerical stability.
- boothby 5 days ago
  
  Thinking about speed like this used to be necessary in C and C++ but these days you should feel free to write the most legible thing (Horner's form) and let the compiler find the optimal code for it (probably similar to Horner's form but broken up to have a shallower dependency chain).
  But if you're writing in an interpreted language that doesn't have a good JIT, or for a platform with a custom compiler, it might be worth hand-tweaking expressions with an eye towards performance and precision.
  
  5 replies →
david-gpu 5 days ago

Yeah, I once worked at a place where the compiler team was assigned the unpleasant task of implementing a long list of trigonometry functions. They struggled for many months to get the accuracy that was required of them, and when they did the performance was abysmal compared to the competition.
In hindsight, they probably didn't have anybody with the right background and should have contracted out the job. I certainly didn't have the necessary knowledge, either.

patchnull 5 days ago

[flagged]

stephencanon 5 days ago
These sorts of approximations (and more sophisticated methods) are fairly widely used in systems programming, as seen by the fact that Apple's asin is only a couple percent slower and sub-ulp accurate (https://members.loria.fr/PZimmermann/papers/accuracy.pdf). I would expect to get similar performance on non-Apple x86 using Intel's math library, which does not seem to have been measured, and significantly better performance while preserving accuracy using a vectorized library call.
The approximation reported here is slightly faster but only accurate to about 2.7e11 ulp. That's totally appropriate for the graphics use in question, but no one would ever use it for a system library; less than half the bits are good.
Also worth noting that it's possible to go faster without further loss of accuracy--the approximation uses a correctly rounded square root, which is much more accurate than the rest of the approximation deserves. An approximate square root will deliver the same overall accuracy and much better vectorized performance.
- Pannoniae 5 days ago
  
  Yeah, the only big problem with approx. sqrt is that it's not consistent across systems, for example Intel and AMD implement RSQRT differently... Fine for graphics, but if you need consistency, that messes things up.
  
  5 replies →
- patchnull 5 days ago
  
  [flagged]
  
  1 reply →
def-pri-pub 5 days ago

I did scan some (major) open source games and graphics related project and found a few of them using `std::asin()`. I plan on submitting some patches.