Comment by tripletao

2 hours ago

So to clarify, you think that replacing every multiplication with 24 transcendental function evaluations (12 eml(x, y), each of which evaluates exp(x) and ln(y) plus the subtraction; see the paper's Fig 2) is somehow a win?

The fact that addition, subtraction, and multiplication run quickly on typical processors isn't arbitrary--those operations map well onto hardware, for roughly the same reasons that elementary school students can easily hand-calculate them. General transcendental functions are fundamentally more expensive in time, die area, and/or power, for the same reasons that elementary school students can't easily hand-calculate them. A primitive where all arithmetic (including addition, subtraction, or negation) involves multiple transcendental function evaluations is not computationally faster, lower-power, lower-area, or otherwise better in any other practical way.

The comments here are filled with people who seem to be unaware of this, and it's pretty weird. Do CS programs not teach computer arithmetic anymore?

1 comment

tripletao

tgtweak 2 hours ago

For basic arithmetic, this is not required nor would it be faster, as it is not likely advantageous for bulk static transcendal functions. Where this becomes interesting is when combining them OR when chaining them where today they must come back out to the main process for reconfiguration and then re-issued.

Practical terms: Jacobian (heavily used in weather and combustion simulation): The transcendental calls, mostly exp(-E_a/RT), are the actual clock-cycle bottleneck. The GPU's SFU computes one exp2 at a time per SM. The ALU then has to convert it (exp(x) = exp2(x × log2(e))), multiply by the pre-exponential factor, and accumulate partial derivatives. It's a long serial chain for each reaction rate.

The core of this is the Arrhenius rate, (A × T^n × exp(-E_a/(R×T))), which involves an exponentiation, a division, a multiplication, and an exponential. On a GPU, that's multiple SFU calls chained with ALU ops. In an EML tree, the whole expression compiles to a single tree that flows through the pipeline in one pass.

GPU (PreJacGPU) is currently the state of the art for speed on these simulations - a moderate width 8-depth EML machine could process a very complex Jacobian as fast as the gpu can evaluate one exp(). Even on a sub-optimal 250mhz FPGA, an entire 50x50 Jacobian would be about 3.5 microseconds vs 50 microseconds PER Jacobian on an A100.

If you put that same logic path into an ASIC, you'd be about 20x the fPGA's speed - in the nanoseconds per round. And this is not like you're building one function into an ASIC it's general purpose. You just feed it a compiled tree configuration and run your data through it.

For anything like linear algebra math, which is also used here, you'd delegate that to the dedicated math functions on the processor - it wouldn't make sense to do those in this.