Comment by pizlonator
2 years ago
The main thing you want to do when optimizing the calling convention is measure its perf, not ruminate about what you think is good. Code performs well if it runs fast, not if it looks like it will.
Sometimes, what the author calls bad code is actually the fastest thing you can do for totally not obvious reasons. The only way to find out is to measure the performance on some large benchmark.
One reason why sometimes bad looking calling conventions perform well is just that they conserve argument registers, which makes the register allocator’s life a tad easier.
Another reason is that the CPUs of today are optimized on traces of instructions generated by C compilers. If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow.
Another reason is that inlining is so successful, so calls are a kind of unusual boundary on the hot path. It’s fine to have some jank on that boundary if it makes other things simpler.
Not saying that the changes done here are bad, but I am saying that it’s weird to just talk about what looks like weird code without measuring.
(Source: I optimized calling conventions for a living when I worked on JavaScriptCore. I also optimized other things too but calling conventions are quite dear to my heart. It was surprising how often bad-looking pass-on-the-stack code won on big, real code. Weird but true.)
I very much agree with that especially since - like you said - code that looks like it will perform well, not always does.
That being said I'd like to add that in my opinion performance measurement results should not be the only guiding principle.
You said it yourself: "Another reason is that the CPUs of today are optimized [..]"
The important word is "today". CPUs evolved and still do and a calling convention should be designed for the long term.
Sadly, it means that it is beneficial to not deviate too much from what C++ does [1], because it is likely that future processor optimizations will be targeted in that direction.
Apart from that it might be worthwhile to consider general principles that are not likely to change (e.g. conserve argument registers, as you mentioned), to make the calling convention robust and future proof.
[1] It feels a bit strange, when I say that because I think Rust has become a bit too conservative in recent years, when it comes to its weirdness budget (https://steveklabnik.com/writing/the-language-strangeness-bu...). You cannot be better without being different, after all.
The Rust calling convention is actually defined as unstable, so 1.79 is allowed to have a different calling convention than 1.80 and so on. I don't think designing one for the long term is a real concern right now.
I know, but from what I understand there are initiatives to stabilize the ABI, which would also mean stabilizing calling conventions. I read the article in that broader context, even if it does not talk about that directly.
4 replies →
If I remember correctly there is a bit of difference between explicit `extern "rust"` and no explicit calling convention but I'm not so sure.
Anyway at least when not using explicit rust representation Rust doesn't even guarantee that the layout of a struct is the same for two repeated build _with the same compiler and code_. That is very intentionally and I think there is no intend to change that "in general" (but various subsets might be standarized, like `Option<&T> where T: Sized` mapping `None` to a null pointer allowing you to use it in C-FFI is already a de facto standard). Which as far as I remember is where explicit extern rust comes in to make sure that we can have a prebuild libstd, it still can change with _any_ compiler version including patch versions. E.g. a hypothetical 1.100 and 1.100.1 might not have the same unstable rust calling convention.
> means that it is beneficial to not deviate too much from what C++ does
Or just C.
Reminds me when I looked up SIMD instructions for searching string views. It was more performant to slap a '\0' on the end and use null terminated string instructions than to use string view search functions .
Huh, I thought they fixed that (the PCMPISTR? string instructions from SSE4.2 being significantly faster than PCMPESTR?), but looks like the explicit-length version still takes twice as many uops on recent Intel and AMD CPUs. They don’t seem to get much use nowadays anyway, though, what with being stuck in the 128-bit realm (there’s a VEX encoding but that’s basically it).
> and a calling convention should be designed for the long term
...isn't the article just about Rust code calling Rust code? That's a much more flexible situation than calling into operating system functions or into other languages. For calling within the same language a stable ABI is by for not as important as on the 'ecosystem boundaries', and might actually be harmful (see the related drama in the C++ world).
You are right, as Josh Triplett also pointed out above. I was mistaken about the plans to stabilize the ABI.
Yep. Also whether passing in registers is faster or not also depends on the function body. It doesn't make much sense if the first thing the function does is to take the address of the parameter and passes it to some opaque function. Then it needs to be spilled onto the stack anyway.
It would be interesting to see calling convention optimizations based on function body. I think that would be safe for static functions in C, as long as their address is not taken.
Dynamic calling conventions also won't work with dynamic linking
Even when dynamic linking a lot of calls will be internal to each library. So you can either:
1. Use a stable calling convention for external interfaces.
2. Use a stable calling convention for everything, but generate trampolines for external calls.
Swift is actually pretty cool here. It basically does 2. But you can also specify which dependencies are "pinned" so that even if they are dynamically linked they can't be updated without a recompile. Then you can use the unstable calling convention when calling those dependencies.
Your experience is not perfectly transferable. JITs have it easy on this because they've already gathered a wealth of information about the actually-executing-on CPU by the time they generate a single line of assembly. Calls appear on the hot path more often in purely statically compiled code because things like the runtime architectural feature set are not known, so you often reach inlining barriers precisely in the code that you would most like to optimize.
LLVM inlines even more than my JIT does.
The JIT has both more and less information.
It has more information about the program globally. There is no linking or “extern” boundary.
But whereas the AOT compiler can often prove that it knows about all of the calls to a function that could ever happen, the JIT only knows about those that happened in the past. This makes it hard (and sometimes impossible) for the JIT to do the call graph analysis style of inlining that llvm does.
One great example of something I wish my jit had but might never be able to practically do, but that llvm eats for breakfast: “if A calls B in one place and nothing else calls B, then inline B no matter how large it is”.
(I also work on ahead of time compilers, though my impact there hasn’t been so big that I brag about it as much.)
> the JIT only knows about those that happened in the past.
This is typically handled by assuming that all future calls will be representative of past calls. Then you add a (cheap) check for that assumption and fall back to interpreter or an earlier JIT that didn't make that assumption.
This can actually be better than AOT because you may have some incredibly rare error path that creates a second call to the function. But you are better off compiling that function assuming that the error never occurs. In the unlikely case it does occur you can fall back to the slower path and end up faster overall. Unless the AOT compiler wants to emit two specializations of the function the generated code needs to handle all possible cases, no matter how unlikely.
Of course in practice AOT wins. But there are many interesting edge cases where a JIT can pull off an optimization that an AOT compiler can't do.
The people who write JITs also write a bunch of C++ that gets statically compiled.
And remember that performance can include binary size, not just runtime speed. Current Rust seems to suffer in that regard for small platforms, calling convention could possibly help there wrt Result returns.
The current calling convention is terrible for small platforms, especially when using Result<> in return position. For large enums, the compiler should put the discriminant in a register and the large variants on the stack. As is, you pay a significant code size penalty for idiomatic rust error handling.
There were proposals for optimizing this kind of stuff for C++ in particular for error handling, like:
https://www.open-std.org/jtc1/sc22/wg21/docs/papers/2019/p07...
> Throwing such values behaves as-if the function returned union{R;E;}+bool where on success the function returns the normal return value R and on error the function returns the error value type E, both in the same return channel including using the same registers. The discriminant can use an unused CPU flag or a register
1 reply →
Also a thing you gotta measure.
Passing a lot of stuff in registers causes a register shuffle at call sites and prologues. Hard to predict if that’s better or worse than stack spills without measuring.
"If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow."
The FA is mostly about x86 and Intel indeed did an amazing amount of clever engineering over decades to allow your ugly x86 code to run fast on their silicon that you buy.
Still, does your point about the empirical benefit of passing on the stack continue to apply with a transition to register rich ARMV8 CPUs or RISC-V?
ARM follows its own calling convention, which by default uses registers for both argument and return value passing [1], so these lessons likely do not apply.
[1] https://developer.arm.com/documentation/dui0041/c/ARM-Proced...
Yes.
If you flatten big structs into registers to pass them you have a bad time on armv8.
I tried. That was an llvm experiment. Ahead of time compiler for a modified version of C.
If you want fast, then you probably need to have a different calling convention per call.