Comment by janwas
1 year ago
I'm also in the mission-critical camp, with perhaps an interesting counterpoint. If we're focusing on small details (or drowning in incidental complexity), it can be harder to see algorithmic optimizations. Or the friction of changing huge amounts of per-platform code can prevent us from escaping a local minimum.
Example: our new matmul outperforms a well-known library for LLM inference, sometimes even if it uses AMX vs our AVX512BF16. Why? They seem to have some threading bottleneck, or maybe it's something else; hard to tell with a JIT involved.
This would not have happened if I had to write per-platform kernels. There are only so many hours in the day. Writing a single implementation using Highway enabled exploring more of the design space, including a new kernel type and an autotuner able to pick not only block sizes, but also parallelization strategies and their parameters.
Perhaps in a second step, one can then hand-tune some parts, but I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.
> I sure hope a broader exploration precedes micro-optimizing register allocation and calling conventions.
It should be obvious that both are pursued independently whenever it makes sense. The idea that one should precede the other or is more important than the other is simply untrue.
How can tuning be independent of devising the algorithm?
Are you really suggesting writing a variant of a kernel, tuning it to the max, then discovering a new and different way to do it, and then discarding the first implementation? That seems like a lot of wasted effort.