Comment by tripletao

10 hours ago

It would almost always be much, much worse. Practical numerical libraries (whether implemented in hardware or software) contain lots of redundancy, because their goal is to give you an optimized primitive as close as possible to the operation you actually want. For example, the library provides an optimized tan(x) to save you from calling sin(x)/cos(x), because one nasty function evaluation (as a power series, lookup table, CORDIC, etc.) is faster than two nasty function evaluations and a divide.

Of course the redundant primitives aren't free, since they add code size or die area. In choosing how many primitives to provide, the designer of a numerical library aims to make a reasonable tradeoff between that size cost and the speed benefit.

This paper takes that tradeoff to the least redundant extreme because that's an interesting theoretical question, at the cost of transforming commonly-used operations with simple hardware implementations (e.g. addition, multiplication) into computational nightmares. I don't think anyone has found a practical application for their result yet, but that's not the point of the work.

I actually don't think this is true -

Traditional processors, even highly dedicated ones like TMUs in gpus, still require being preconfigured substantially in order to switch between sin/cos/exp2/log2 function calls, whereas a silicon implementation of an 8-layer EML machine could do that by passing a single config byte along with the inputs. If you had a 512-wide pipeline of EML logic blocks in modern silicon (say 5nm), you could get around 1 trillion elementary function evaluations per second on 2.5ghz chip. Compare this with a 96 core zen5 server CPU with AVX-512 which can do about 50-100 billion scalar-equivalent evaluations per second across all cores only for one specific unchanging function.

Take the fastest current math processors: TMUs on a modern gpu: it can calculate sin OR cos OR exp2 OR log2 in 1 cycle per shader unit... but that is ONLY for those elementary functions and ONLY if they don't change - changing the function being called incurs a huge cycle hit, and chaining the calculations also incurs latency hits. An EML coprocessor could do arcsinh(x² + ln(y)) in the same hardware block, with the same latency as a modern cpu can do a single FMA instruction.

  • So to clarify, you think that replacing every multiplication with 24 transcendental function evaluations (12 eml(x, y), each of which evaluates exp(x) and ln(y) plus the subtraction; see the paper's Fig 2) is somehow a win?

    The fact that addition, subtraction, and multiplication run quickly on typical processors isn't arbitrary--those operations map well onto hardware, for roughly the same reasons that elementary school students can easily hand-calculate them. General transcendental functions are fundamentally more expensive in time, die area, and/or power, for the same reasons that elementary school students can't easily hand-calculate them. A primitive where all arithmetic (including addition, subtraction, or negation) involves multiple transcendental function evaluations is not computationally faster, lower-power, lower-area, or otherwise better in any other practical way.

    The comments here are filled with people who seem to be unaware of this, and it's pretty weird. Do CS programs not teach computer arithmetic anymore?