← Back to context

Comment by jcranmer

4 hours ago

I am extremely skeptical that that would be the case. Local stack accesses are pretty guaranteed to be L1 cache hits, and if any memory access can be made fast, it's accesses to the local stack. The general rule of thumb for performance engineering is that you're optimizing for L2 cache misses if you can't fit in L2 cache, so overall, I'd be shocked if this convoluted calling convention could eke out more than a few percent improvement, and even 1% I'm skeptical of. Meanwhile, making 14-argument functions is going to create a lot of extra work in several places for LLVM that I can think of (for starters, most of the SmallVector<Value *, N> handling is choosing 4 or 8 for N, so there's going to be a lot of heap-allocate a 14-element array going on), which will more than eat up the gains you'd be expecting.