Comment by tgtweak
4 hours ago
For basic arithmetic, this is not required nor would it be faster, as it is not likely advantageous for bulk static transcendal functions. Where this becomes interesting is when combining them OR when chaining them where today they must come back out to the main process for reconfiguration and then re-issued.
Practical terms: Jacobian (heavily used in weather and combustion simulation): The transcendental calls, mostly exp(-E_a/RT), are the actual clock-cycle bottleneck. The GPU's SFU computes one exp2 at a time per SM. The ALU then has to convert it (exp(x) = exp2(x × log2(e))), multiply by the pre-exponential factor, and accumulate partial derivatives. It's a long serial chain for each reaction rate.
The core of this is the Arrhenius rate, (A × T^n × exp(-E_a/(R×T))), which involves an exponentiation, a division, a multiplication, and an exponential. On a GPU, that's multiple SFU calls chained with ALU ops. In an EML tree, the whole expression compiles to a single tree that flows through the pipeline in one pass.
GPU (PreJacGPU) is currently the state of the art for speed on these simulations - a moderate width 8-depth EML machine could process a very complex Jacobian as fast as the gpu can evaluate one exp(). Even on a sub-optimal 250mhz FPGA, an entire 50x50 Jacobian would be about 3.5 microseconds vs 50 microseconds PER Jacobian on an A100.
If you put that same logic path into an ASIC, you'd be about 20x the fPGA's speed - in the nanoseconds per round. And this is not like you're building one function into an ASIC it's general purpose. You just feed it a compiled tree configuration and run your data through it.
For anything like linear algebra math, which is also used here, you'd delegate that to the dedicated math functions on the processor - it wouldn't make sense to do those in this.
> The core of this is the Arrhenius rate, (A × T^n × exp(-E_a/(R×T))), which involves an exponentiation, a division, a multiplication, and an exponential. On a GPU, that's multiple SFU calls chained with ALU ops. In an EML tree, the whole expression compiles to a single tree that flows through the pipeline in one pass.
I think you're missing the reason why the GPU kicks you out of the fast path when you need that special function. The special function evaluation is fundamentally more expensive in energy, whether that cost is paid in area or time. Evaluation of the special functions with throughput similar to the arithmetic throughput would require much more area for the special functions, which for most computation isn't a good tradeoff. That's why the GPU's designers chose to make your exp2 slow.
Replacing everything with dozens of cascaded special functions makes everything uniform, but it's uniformly much worse. I feel like you're assuming that by parallelizing your "EML tree" in dedicated hardware that problem goes away; but area isn't free in either dollars or power, so it doesn't.
There is a huge market for "its faster" at the cost of efficiency, but I don't think your claim that an EML hardware block would be inherently less inefficient than the same workload running on a GPU. If you think it would be, back it up with some numbers.
A 10-stage EML pipeline would be about the size of an avx-512 instruction block on a modern CPU, in the realm of ~0.1mm2 on a 5nm process node (collectively including the FMA units behind it), at it's entirety about 1% of the CPU die. None of this suggests that even a ~500 wide 10-stage EML pipeline would be consuming anywhere near the power of a modern datacenter GPU (which wastes a lot of it's energy moving things from memory to ALU to shader core...).
Not sure if you're arguing from a hypothetical position or practical one but you seem to be narrowing your argument to "well for simple math it's less efficient" but that's not the argument being made at all.