← Back to context

Comment by JonChesterfield

2 years ago

Reasonable sketch. This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.

It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.

Changing ABI with optimization setting interacts really badly with separate compilation.

Shuffling arguments around in bin packing fashion does work but introduces a lot of complexity in the compiler, not sure it's worth it relative to left to right first fit. It also makes it difficult for the developer to predict where arguments will end up.

The general plan of having different calling conventions for addresses that escape than for those that don't is sound. Peeling off a prologue that does the impedance matching works well.

Rust probably should be willing to have a different calling convention to C, though I'm not sure it should be a hardcoded one that every function uses. Seems an obvious thing to embed in the type system to me and allowing developer control over calling convention removes one of the performance advantages of assembly.

> This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.

Out of curiosity, what's so problematic about using some input registers as output registers? On the caller's side, you'd want to vacate the output registers between any two function calls regardless. And it occurs pretty widely in syscall conventions, to my binary-golfing detriment.

Is it for the ease of the callee, so that it can set up the output values while keeping the input values in place? That would suggest trying to avoid overlap (by placing the output registers at the end of the input sequence), but I don't see how it would totally contraindicate any overlap.

  • You should use all the input registers as output registers, unless your arch is doing some sliding window thing. The x64 proposal linked uses six to pass arguments in and three to return results. So returning six integers means three in registers, three on the stack, with three registers that were free to use containing nothing in particular.

    • The LLVM calling conventions for x86 only allow returning 3 integer registers, 4 vector registers, and 2 x87 floating point registers (er, stack slots technically because x87 is weird).

      9 replies →

Allowing developer control over calling conventions is also simultaneous with disallowing optimization in the case that Function A calls Function B calls Function C calls Function D etc. but along the way one or more of those functions could have their arguments swapped around to a different convention to reduce overhead. What semantics would preserve such an optimization but allow control? Would it just be illusory?

And in practice assembly has the performance disadvantage of not being subject to most compiler optimizations, often including "introspecting on its operation, determining it is fully redundant, and eliminating it entirely". It's not the 1990s anymore.

In the cases where that kind of optimization is not even possible to consider, though, the only place I'd expect inline assembly to be decisively beaten is using profile-guided optimization. That's the only way to extract more information than "perfect awareness of how the application code works", which the app dev has and the compiler dev does not. The call overhead can be eliminated by simply writing more assembly until you've covered the relevant hot boundaries.

  • If those functions are external you've lost that optimisation anyway. If they're not, the compiler chooses whether to ignore your annotation or not as usual. As is always the answer, the compiler doesn't get to make observable changes (unless you ask it to, fwrong-math style).

    I'd like to specify things like extra live out registers, reduced clobber lists, pass everything on the stack - but on the function declaration or implementation, not having to special case it in the compiler itself.

    Sufficiently smart programmers beat ahead of time compilers. Sufficiently smart ahead of time compilers beat programmers. If they're both sufficiently smart you get a common fix point. I claim that holds for a jit too, but note that it's just far more common for a compiler to rewrite the code at runtime than for a programmer to do so.

    I'd say that assembly programmers are rather likely to cut out parts of the program that are redundant, and they do so with domain knowledge and guesswork that is difficult to encode in the compiler. Both sides are prone to error, with the classes of error somewhat overlapping.

    I think compilers could be a lot better at codegen than they presently are, but the whole "programmers can't beat gcc anymore" idea isn't desperately true even with the current state of the art.

    Mostly though I want control over calling conventions in the language instead of in compiler magic because it scales much better than teaching the compiler about properties of known functions. E.g. if I've written memcpy in asm, it shouldn't be stuck with the C caller save list, and avoiding that shouldn't involve a special case branch in the compiler backend.

> It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.

DWARF doesn't encode bespoke calling conventions at all today.

The bin packing will probably make it slower though, especially in the bool case since it will create dependency chains. For bools on x64, I don‘t think there‘s a better way than first having to get them in a register, shift them and then OR them into the result. The simple way creates a dependency chain of length 64 (which should also incur a 64 cycle penalty) but you might be able to do 6 (more like 12 realistically) cycles. But then again, where do these 64 bools come from? There aren‘t that many registers so you will have to reload them from the stack. Maybe the rust ABI already packs bools in structs this tightly so it‘s work that has to be done anyway but I don‘t know too much about it.

And then the caller will have to unpack everything again. It might be easier to just teach the compiler to spill values into the result space on the stack (in cases the IR doesn‘t already store the result after the computation) which will likely also perform better.

  • Unpacking bools is cheap - to move any bit into a flag is just a single 'test' instruction, which is as good as it gets if you have multiple bools (other than passing each in a separate flag, which is quite undesirable).

    Doing the packing in a tree fashion to reduce latency is trivial, and store→load latency isn't free either depending on the microarchitecture (and at the counts where log2(n) latency becomes significant you'll be at IPC limit anyway). Packing vs store should end up at roughly the same instruction counts too - a store vs an 'or', and exact same amount of moving between flags ang GPRs.

    Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option<T>, where the packing would reduce needed register/stack slot count by ~2.

    Where possible it would of course make sense to pass values in separate registers instead of in one, but when the alternative is spilling to stack, packing is still worthy of consideration.

    • > Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option<T>, where the packing would reduce needed register/stack slot count by ~2

      I don't have a strong sense of how much more common owned `Option` types are than references, but it's worth noting that if `T` is a reference, `Option<T>` will just use a pointer and treat the null value as `None` under the hood to avoid needing any tag. There are probably other types where this is done as well (maybe `NonZero` integer types?)

      3 replies →

Also, most modern processors will easily forward the store to the subsequent read and has a bunch of tricks for tracking the stack state. So much does putting things in registers help anyway?

  • Forwarding isn't unlimited, though, as I understand it. The CPU has limited-size queues and buffers through which reordering, forwarding, etc. can happen. So I wouldn't be surprised if using registers well takes pressure off of that machinery and ensures that it works as you expect for the data that isn't in registers.

    (Looked around randomly to find example data for this) https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memo... claims that Zen 4's store queue only holds 64 entries, for example, and a 512-bit register store eats up two. I can imagine how an algorithm could fill that queue up by juggling enough data.

    • It’s limited, but in the argument passing context you’re storing to a location that’s almost certainly in L1, and then probably loading it immediately within the called function. So the store will likely take up a store queue slot for just a few cycles before the store retires.

      1 reply →

  • More broadly: processor design has been optimised around C style antics for a long time, trying to optimise the code produced away from that could well inhibit processor tricks in such a way that the result is _slower_ than if you stuck with the "looks terrible but is expected & optimised" status quo

    • Reminds me of Fortran compilers recognising the naive three-nested-loops matrix multiplication and optimising it to something sensible.

  • Register allocation decisions routinely result in multi-percent performance changes, so yes, it does.

    Also, registers help the MachineInstr-level optimization passes in LLVM, of which there are quite a few.