Comment by drsopp

5 days ago

Did some quick calculations, and at this precision, it seems a table lookup might be able to fit in the L1 cache depending on the CPU model.

Microbenchmarks. A LUT will win many of them but you pessimise the rest of the code. So unless a significant (read: 20+%) portion of your code goes into the LUT, there isn't that much point to bother. For almost any pure calculation without I/O, it's better to do the arithmetic than to do memory access.

  • Locality within the LUT matters too: if you know you're looking up identical or nearby-enough values to benefit from caching, an LUT can be more of a win. You only pay the cache cost for the portion you actually touch at runtime.

    I could imagine some graphics workloads tend compute asin() repeatedly with nearby input values. But I'd guess the locality isn't local enough to matter, only eight double precision floats fit in a cache line.

  • Cache size and replacement policies can ruin even a well-tuned LUT once your working set grows or other threads spray cache lines so "just use a LUT" quietly turns into "debug the perf cliff" later. If the perf gain disappears under load or with real input sets you realise too late it was just a best-case microbenchmark trick.

Surely the loss in precision of a 32KB LUT for double precision asin() would be unacceptable?

  • By interpolating between values you can get excellent results with LUTs much smaller than 32KB. Will it be faster than the computation from op, that I don't know.

    • I experimented a bit with the code. Various tables with different datatypes. There is enough noise from the Monte Carlo to not make a difference if you use smaller data types than double or float. Even dropping interpolation worked fine, and got the speed to be on par with the best in the article, but not faster.

      2 replies →

    • I'm very skeptical you wouldn't get perceptible visual artifacts if you rounded the trig functions to 4096 linear approximations. But I'd be happy to be proven wrong :)