The main thing you want to do when optimizing the calling convention is measure its perf, not ruminate about what you think is good. Code performs well if it runs fast, not if it looks like it will.
Sometimes, what the author calls bad code is actually the fastest thing you can do for totally not obvious reasons. The only way to find out is to measure the performance on some large benchmark.
One reason why sometimes bad looking calling conventions perform well is just that they conserve argument registers, which makes the register allocator’s life a tad easier.
Another reason is that the CPUs of today are optimized on traces of instructions generated by C compilers. If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow.
Another reason is that inlining is so successful, so calls are a kind of unusual boundary on the hot path. It’s fine to have some jank on that boundary if it makes other things simpler.
Not saying that the changes done here are bad, but I am saying that it’s weird to just talk about what looks like weird code without measuring.
(Source: I optimized calling conventions for a living when I worked on JavaScriptCore. I also optimized other things too but calling conventions are quite dear to my heart. It was surprising how often bad-looking pass-on-the-stack code won on big, real code. Weird but true.)
I very much agree with that especially since - like you said - code that looks like it will perform well, not always does.
That being said I'd like to add that in my opinion performance measurement results should not be the only guiding principle.
You said it yourself: "Another reason is that the CPUs of today are optimized [..]"
The important word is "today". CPUs evolved and still do and a calling convention should be designed for the long term.
Sadly, it means that it is beneficial to not deviate too much from what C++ does [1], because it is likely that future processor optimizations will be targeted in that direction.
Apart from that it might be worthwhile to consider general principles that are not likely to change (e.g. conserve argument registers, as you mentioned), to make the calling convention robust and future proof.
[1] It feels a bit strange, when I say that because I think Rust has become a bit too conservative in recent years, when it comes to its weirdness budget (https://steveklabnik.com/writing/the-language-strangeness-bu...). You cannot be better without being different, after all.
The Rust calling convention is actually defined as unstable, so 1.79 is allowed to have a different calling convention than 1.80 and so on. I don't think designing one for the long term is a real concern right now.
> means that it is beneficial to not deviate too much from what C++ does
Or just C.
Reminds me when I looked up SIMD instructions for searching string views. It was more performant to slap a '\0' on the end and use null terminated string instructions than to use string view search functions .
> and a calling convention should be designed for the long term
...isn't the article just about Rust code calling Rust code? That's a much more flexible situation than calling into operating system functions or into other languages. For calling within the same language a stable ABI is by for not as important as on the 'ecosystem boundaries', and might actually be harmful (see the related drama in the C++ world).
Yep. Also whether passing in registers is faster or not also depends on the function body. It doesn't make much sense if the first thing the function does is to take the address of the parameter and passes it to some opaque function. Then it needs to be spilled onto the stack anyway.
It would be interesting to see calling convention optimizations based on function body. I think that would be safe for static functions in C, as long as their address is not taken.
Your experience is not perfectly transferable. JITs have it easy on this because they've already gathered a wealth of information about the actually-executing-on CPU by the time they generate a single line of assembly. Calls appear on the hot path more often in purely statically compiled code because things like the runtime architectural feature set are not known, so you often reach inlining barriers precisely in the code that you would most like to optimize.
It has more information about the program globally. There is no linking or “extern” boundary.
But whereas the AOT compiler can often prove that it knows about all of the calls to a function that could ever happen, the JIT only knows about those that happened in the past. This makes it hard (and sometimes impossible) for the JIT to do the call graph analysis style of inlining that llvm does.
One great example of something I wish my jit had but might never be able to practically do, but that llvm eats for breakfast: “if A calls B in one place and nothing else calls B, then inline B no matter how large it is”.
(I also work on ahead of time compilers, though my impact there hasn’t been so big that I brag about it as much.)
And remember that performance can include binary size, not just runtime speed. Current Rust seems to suffer in that regard for small platforms, calling convention could possibly help there wrt Result returns.
The current calling convention is terrible for small platforms, especially when using Result<> in return position. For large enums, the compiler should put the discriminant in a register and the large variants on the stack. As is, you pay a significant code size penalty for idiomatic rust error handling.
Passing a lot of stuff in registers causes a register shuffle at call sites and prologues. Hard to predict if that’s better or worse than stack spills without measuring.
"If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow."
The FA is mostly about x86 and Intel indeed did an amazing amount of clever engineering over decades to allow your ugly x86 code to run fast on their silicon that you buy.
Still, does your point about the empirical benefit of passing on the stack continue to apply with a transition to register rich ARMV8 CPUs or RISC-V?
ARM follows its own calling convention, which by default uses registers for both argument and return value passing [1], so these lessons likely do not apply.
Reasonable sketch. This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.
It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.
Changing ABI with optimization setting interacts really badly with separate compilation.
Shuffling arguments around in bin packing fashion does work but introduces a lot of complexity in the compiler, not sure it's worth it relative to left to right first fit. It also makes it difficult for the developer to predict where arguments will end up.
The general plan of having different calling conventions for addresses that escape than for those that don't is sound. Peeling off a prologue that does the impedance matching works well.
Rust probably should be willing to have a different calling convention to C, though I'm not sure it should be a hardcoded one that every function uses. Seems an obvious thing to embed in the type system to me and allowing developer control over calling convention removes one of the performance advantages of assembly.
> This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.
Out of curiosity, what's so problematic about using some input registers as output registers? On the caller's side, you'd want to vacate the output registers between any two function calls regardless. And it occurs pretty widely in syscall conventions, to my binary-golfing detriment.
Is it for the ease of the callee, so that it can set up the output values while keeping the input values in place? That would suggest trying to avoid overlap (by placing the output registers at the end of the input sequence), but I don't see how it would totally contraindicate any overlap.
You should use all the input registers as output registers, unless your arch is doing some sliding window thing. The x64 proposal linked uses six to pass arguments in and three to return results. So returning six integers means three in registers, three on the stack, with three registers that were free to use containing nothing in particular.
Allowing developer control over calling conventions is also simultaneous with disallowing optimization in the case that Function A calls Function B calls Function C calls Function D etc. but along the way one or more of those functions could have their arguments swapped around to a different convention to reduce overhead. What semantics would preserve such an optimization but allow control? Would it just be illusory?
And in practice assembly has the performance disadvantage of not being subject to most compiler optimizations, often including "introspecting on its operation, determining it is fully redundant, and eliminating it entirely". It's not the 1990s anymore.
In the cases where that kind of optimization is not even possible to consider, though, the only place I'd expect inline assembly to be decisively beaten is using profile-guided optimization. That's the only way to extract more information than "perfect awareness of how the application code works", which the app dev has and the compiler dev does not. The call overhead can be eliminated by simply writing more assembly until you've covered the relevant hot boundaries.
If those functions are external you've lost that optimisation anyway. If they're not, the compiler chooses whether to ignore your annotation or not as usual. As is always the answer, the compiler doesn't get to make observable changes (unless you ask it to, fwrong-math style).
I'd like to specify things like extra live out registers, reduced clobber lists, pass everything on the stack - but on the function declaration or implementation, not having to special case it in the compiler itself.
Sufficiently smart programmers beat ahead of time compilers. Sufficiently smart ahead of time compilers beat programmers. If they're both sufficiently smart you get a common fix point. I claim that holds for a jit too, but note that it's just far more common for a compiler to rewrite the code at runtime than for a programmer to do so.
I'd say that assembly programmers are rather likely to cut out parts of the program that are redundant, and they do so with domain knowledge and guesswork that is difficult to encode in the compiler. Both sides are prone to error, with the classes of error somewhat overlapping.
I think compilers could be a lot better at codegen than they presently are, but the whole "programmers can't beat gcc anymore" idea isn't desperately true even with the current state of the art.
Mostly though I want control over calling conventions in the language instead of in compiler magic because it scales much better than teaching the compiler about properties of known functions. E.g. if I've written memcpy in asm, it shouldn't be stuck with the C caller save list, and avoiding that shouldn't involve a special case branch in the compiler backend.
> It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.
DWARF doesn't encode bespoke calling conventions at all today.
The bin packing will probably make it slower though, especially in the bool case since it will create dependency chains.
For bools on x64, I don‘t think there‘s a better way than first having to get them in a register, shift them and then OR them into the result. The simple way creates a dependency chain of length 64 (which should also incur a 64 cycle penalty) but you might be able to do 6 (more like 12 realistically) cycles.
But then again, where do these 64 bools come from? There aren‘t that many registers so you will have to reload them from the stack.
Maybe the rust ABI already packs bools in structs this tightly so it‘s work that has to be done anyway but I don‘t know too much about it.
And then the caller will have to unpack everything again.
It might be easier to just teach the compiler to spill values into the result space on the stack (in cases the IR doesn‘t already store the result after the computation) which will likely also perform better.
Unpacking bools is cheap - to move any bit into a flag is just a single 'test' instruction, which is as good as it gets if you have multiple bools (other than passing each in a separate flag, which is quite undesirable).
Doing the packing in a tree fashion to reduce latency is trivial, and store→load latency isn't free either depending on the microarchitecture (and at the counts where log2(n) latency becomes significant you'll be at IPC limit anyway). Packing vs store should end up at roughly the same instruction counts too - a store vs an 'or', and exact same amount of moving between flags ang GPRs.
Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option<T>, where the packing would reduce needed register/stack slot count by ~2.
Where possible it would of course make sense to pass values in separate registers instead of in one, but when the alternative is spilling to stack, packing is still worthy of consideration.
Also, most modern processors will easily forward the store to the subsequent read and has a bunch of tricks for tracking the stack state. So much does putting things in registers help anyway?
Forwarding isn't unlimited, though, as I understand it. The CPU has limited-size queues and buffers through which reordering, forwarding, etc. can happen. So I wouldn't be surprised if using registers well takes pressure off of that machinery and ensures that it works as you expect for the data that isn't in registers.
(Looked around randomly to find example data for this) https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memo... claims that Zen 4's store queue only holds 64 entries, for example, and a 512-bit register store eats up two. I can imagine how an algorithm could fill that queue up by juggling enough data.
More broadly: processor design has been optimised around C style antics for a long time, trying to optimise the code produced away from that could well inhibit processor tricks in such a way that the result is _slower_ than if you stuck with the "looks terrible but is expected & optimised" status quo
The C calling convention kind of sucks. True, can't change the C calling convention, but that doesn't make it any less unfortunate.
We should use every available caller-saved register for arguments and return values, but in the traditional SysV ABI, we use only one register (sometimes two) for return values. If you return a struct Point3D { long x, y, z }, you spill the stack even though we could damned well put Point3D in rax, rdi, and rsi.
There are other tricks other systems use. For example, if I recall correctly, in SBCL, functions set the carry flag on exit if they're returning multiple values. Wouldn't it be nice if we used the carry flag in indicate, e.g. whether a Result contains an error.
"sucks" is a strong word but with respect to return values, you're right. The C calling conventions, everywhere really, support what C supports - returning one argument. Well, not even that (struct returns ... nope).
Kind of "who'd have thought" in C I guess. And then there's the C++ argument "just make it inline then".
On the other hand, memory spills happen. For SPARC, for example, the gracious register space (windows) ended up with lots of unused regs in simple functions and a cache-busting huge stack size footprint, definitely so if you ever spilled the register ring. Even with all the mov in x86 (and there is always lots of it, at least in compiled C code) to rearrange data to "where it needed to be", it often ended up faster.
When you only look at the callee code (code generated for a given function signature), it's tempting to say "oh it'll definitely be fastest if this arg is here and that return there". You don't know the callers though. There's no guarantee the argument marshalling will end up "pass through" or the returns are "hot" consumed. Say, a struct Point { x: i32, y: i32, z: i32 } as arg/return; if the caller does something like mystruct.deepinside.point[i] = func(mystruct.deepinside.point[i]) in a loop then moving it in/out of regs may be overhead or even prevent vectorisation. But the callee cannot know. Unless... the compiler can see both and inline (back to the C++ excuse). Yes, for function call chaining javascript/rust style it might be nice/useful "in principle". But in practice only if the compiler has enough caller/callee insight to keep the hot working set "passthrough" (no spills).
The lowest hanging fruit on calling is probably to remove the "functions return one primitive thing" that's ingrained in the C ABIs almost everywhere. For the rest ? A lot of benchmarking and code generation statistics. I'd love to see more of that. Even if it's dry stuff.
It's just limited that on x86-64, GCC and Clang use up to two registers while MSVC only uses one.
Also, IMHO there is no such thing as a "C calling convention", there are many different calling conventions that are defined by the various runtime environments (usually the combination of CPU architecture and operating system). C compilers just must adhere to those CPU+OS calling conventions like any other language that wants to interact directly with the operating system.
IMHO the whole performance angle is a bit overblown though, for 'high frequency functions' the compiler should inline the function body anyway. And for situations where that's not possible (e.g. calling into DLLs), the DLL should expose an API that doesn't require such 'high frequency functions' in the first place.
Tangentially related, there's another "unfortunate" detail of Rust that makes some structs bigger than you want them to be. Imagine a struct Foo that contains eight `Option<u8>` fields, ie each field is either `None` or `Some(u8)`. In C, you could represent this as a struct with eight 1-bit `bool`s and eight `uint8_t`s, for a total size of 9 bytes. In Rust however, the struct will be 16 bytes, ie eight sequences of 1-byte discriminant followed by a `uint8_t`.
Why? The reason is that structs must be able to present borrows of their fields, so given a `&Foo` the compiler must allow the construction of a `&Foo::some_field`, which in this case is an `&Option<u8>`. This `&Option<u8>` must obviously look identical to any other `&Option<u8>` in the program. Thus the underlying `Option<u8>` is forced to have the same layout as any other `Option<u8>` in the program, ie its own personal discriminant bit rounded up to a byte followed by its `u8`. The struct pays this price even if the program never actually constructs a `&Foo::some_field`.
This becomes even worse if you consider Options of larger types, like a struct with eight `Option<u16>` fields. Then each personal discriminant will be rounded up to two bytes, for a total size of 32 bytes with a quarter (or almost half, if you include the unused bits of the discriminants) being wasted interstitial padding. The C equivalent would only be 18 bytes. With `Option<u64>`, the Rust struct would be 128 bytes while the C struct would be 72 bytes.
You *can* implement the C equivalent manually of course, with a `u8` for the packed discriminants and eight `MaybeUninit<T>`s, and functions that map from `&Foo` to `Option<&T>`, `&mut Foo` to `Option<&mut T>`, etc, but not to `&Option<T>` or `&mut Option<T>`.
You have to implement the C version manually, so it's not that odd you'd need to do the same for Rust?
You've described, basically, a custom type that is 8 Options<u8>s. If you start caring about performance you'll need to roll your own internal Option handling.
There's no "manually" about it. There's only one way to implement it in C, ie eight booleans and eight uint8_ts as I described. Going from there to the further optimization of adding a `:1` to every `bool` field is a simple optimization. Reimplementing `Option` and the bitpacking of the discriminants is much more effort compared to the baseline implementation of using `Option`.
> You can implement the C equivalent manually of course
But you have to implement the C version manually as well.
It's not really a downside to Rust if it provides a convenient feature that you can choose to use if it fits your goals.
The use case you're describing is relatively rare. If it's an actual performance bottleneck then spending a little extra time to implement it in Rust doesn't seem like a big deal. I have a hard time considering this an "unfortunate detail" to Rust when the presence of the Option<_> type provides so much benefit in typical use cases.
> If a non-polymorphic, non-inline function may have its address taken (as a function pointer), either because it is exported out of the crate or the crate takes a function pointer to it, generate a shim that uses -Zcallconv=legacy and immediately tail-calls the real implementation. This is necessary to preserve function pointer equality.
If the legacy shim tail calls the Rust-calling-convention function, won't that prevent it from fixing any return value differences in the calling convention?
Tangentially related: Is it currently possible to have interop between Go and Rust? I remember seeing someone achieving it with Zig in the middle but can’t for the sake of me find it. Have some legacy Rust code (what??) that I’m hoping to slowly port to Go piece by piece
Yes, you can use CGO to call Rust functions using extern "C" FFI. I gave a talk about how we use it for GitHub code search at RustConf 2023 (https://www.youtube.com/watch?v=KYdlqhb267c) and afterwards I talked to some other folks (like 1Password) who are doing similar things.
It's not a lot of fun because moving types across the C interop boundary is tedious, but it is possible and allows code reuse.
If you want to call from Go into Rust, you can declare any Rust function as `extern "C"` and then call it the same way you would call C from Go. Not sure about going the other way.
It's usually unwise to mix managed and unmanaged memory since the managed code needs to be able to own the memory its freeing and moving whereas the unmanaged code needs to reason about when memory is freed or moved. cgo (and other variants) let you mix FFI calls into unmanaged memory from managed code in Go, but you pay a penalty for it.
In language implementations where GC isn't shared by the different languages calling each other you're always going to have this problem. Mixing managed/unmanaged code is both an old idea and actively researched.
It's almost always a terrible idea to call into managed code from unmanaged code unless you're working directly with an embedded runtime that's been designed for it. And when you do, there's usually a serialization layer in between.
> It's usually unwise to mix managed and unmanaged memory
Broadly stated, you can achieve this by marking a managed object as a GC root whenever it's to be referenced by unmanaged code (so that it won't be freed or moved in that case) and adding finalizers whenever managed objects own or hold refcounted references over unmanaged memory (so that the unmanaged code can reason about these objects being freed). But yes, it's a bit fiddly.
Mixing managed and unmanaged code being an issue is simply not true in programming in general.
It may be an issue in Go or Java, but it just isn't in C# or Swift.
Calling `write` in C# on Unix is as easy as the following snippet and has almost no overhead:
var text = "Hello, World!\n"u8;
Interop.Write(1, text, text.Length);
static unsafe partial class Interop
{
[LibraryImport("libc", EntryPoint = "write")]
public static partial void Write(
nint fd, ReadOnlySpan<byte> buffer, nint length);
}
In addition, unmanaged->managed calls are also rarely an issue, both via function pointers and plain C exports if you build a binary with NativeAOT:
public static class Exports
{
[UnmanagedCallersOnly(EntryPoint = "sum")]
public static nint Sum(nint a, nint b) => a + b;
}
It is indeed true that more complex scenarios may require some form of bespoke embedding/hosting of the runtime, but that is more of a peculiarity of Go and Java, not an actual technical limitation.
I have to use Rust and Swift quite a bit. I basically just landed on sending a byte array of serialized protobufs back and forth with cookie cutter function calls. If this is your full time job I can see how you might think that is lame, but I really got tired of coming back to the code every few weeks and not remembering how to do anything.
If you want a particularity cursed example, I've recently called Go code from Rust via C in the middle, including passing a Rust closure with state into the Go code as callback into a Go stdlib function, including panic unwinding from inside the Rust closurehttps://github.com/Voultapher/sort-research-rs/commit/df6c91....
You have to go through C bindings, but FFI is very far from being Go's strongest suit (if we don't count Cgo), so if that's what interests you, it might be better to explore a different language.
I just spent a bunch of time on inspect element trying to figure out how the section headings are set at an angle and (at least with Safari tools), I’m stumped. So how did he do this?
related, I thought the minimap was using the CSS element() function [0], but it turns out it's actually just a copy of the article shrunk down real small.
In contrast: "How Swift Achieved Dynamic Linking Where Rust Couldn't " (2019) [1]
On the one hand I'm disappointed that Rust still doesn't have a calling convention for Rust-level semantics. On the other hand the above article demonstrates the tremendous amount of work that's required to get there. Apple was deeply motivated to build this as a requirement to make Swift a viable system language that applications could rely on, but Rust does not have that kind of backing.
It's only fair to point out that Swift's approach has runtime costs. It would be good to have more supported options for this tradeoff in Rust, including but not limited to https://github.com/rust-lang/rfcs/pull/3470
Notably these runtime costs only occur if you’re calling into another library. For calls within a given swift library, you don’t incur the runtime costs: size checks are elided (since size is known), calls can be inlined, generics are monomorphized… the costs only happen when you’re calling into code that the compiler can’t see.
Given that the current Rust compiler does aggressive inlining and then optimizes, is this worth the trouble? If the function being called is tiny, it should be inlined. If it's big, you're probably going to spend some time in it and the call overhead is minor.
Runtime functions (eg dyn Trait) can’t be inlined for one, so this would help there. But also if you can make calls cheaper then you don’t have to be so aggressive with inlining, which can help with code size and compile times.
Probably? A complex function that’s not a good fit for inlining will probably access memory a few times and those accesses are likely to be the bottlenecks for the function. Passing on the stack squeezes that bottleneck tighter — more cache pressure, load/stores, etc. If Rust can pass arguments optimally in a decent ratio of function calls, not only is it avoiding the several clocks of L1 access, it’s hopefully letting the CPU get to those essential memory bottlenecks faster. There are probably several percentage points of win here…? But I am drinking wine and not doing the math, so…
Delphi, and I'm sure others, have had[1] this for ages:
When you declare a procedure or function, you can specify a calling convention using one of the directives register, pascal, cdecl, stdcall, safecall, and winapi.
As in your example, cdecl is for calling C code, while stdcall/winapi on Windows for calling Windows APIs.
Simply throw it in as a Cargo.toml flag and sidestep the worry. Yes, you do sometimes have to debug release code - but there you can use the not-quite-perfect debugging that the author mentions.
Also, why aren't we size-sorting fields already? That seems like an easy optimization, and can be turned off with a repr.
Did you want alignment sorting? In general the problem with things like that is the ideal layout is usually architecture and application specific - if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.
Yep. It will probably improve (to be measured) the 80%. Less memory means less bandwidth usage etc.
> if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.
I did suggest having a repr for situations like yours. Something like #[repr(yeet)]. Optimizing for false sharing etc. is probably well within 5% of code that exists today, and is usually wrapped up in a library that presents a specific data structure.
Very interesting but pretty quickly went over my head. I have a question that is slightly related to SIMD and LLVM.
Can someone explain simply where does MLIR fit into all of this? Does it standardize more advanced operations across programming languages - such as linear algebra and convolutions?
Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).
> Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).
Are people getting paid to repeat this ad nauseum?
This post is rather unrelated. The linalg dialect can be lowered to LLVM IR, SPIR-V, or you could write your own pass to lower it to e.g. your custom chip.
> Can someone explain simply where does MLIR fit into all of this?
It doesn't.
MLIR is a design for a family of intermediate languages (called 'dialects') that allow you to progressively lower high-level languages into low-level code.
True, creative, but usually in a quality degrading way like here (slanted text is harder to read, also due to the underline being too thick, and takes more space) or like with those poorly legible bg/fg color combinations
The main thing you want to do when optimizing the calling convention is measure its perf, not ruminate about what you think is good. Code performs well if it runs fast, not if it looks like it will.
Sometimes, what the author calls bad code is actually the fastest thing you can do for totally not obvious reasons. The only way to find out is to measure the performance on some large benchmark.
One reason why sometimes bad looking calling conventions perform well is just that they conserve argument registers, which makes the register allocator’s life a tad easier.
Another reason is that the CPUs of today are optimized on traces of instructions generated by C compilers. If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow.
Another reason is that inlining is so successful, so calls are a kind of unusual boundary on the hot path. It’s fine to have some jank on that boundary if it makes other things simpler.
Not saying that the changes done here are bad, but I am saying that it’s weird to just talk about what looks like weird code without measuring.
(Source: I optimized calling conventions for a living when I worked on JavaScriptCore. I also optimized other things too but calling conventions are quite dear to my heart. It was surprising how often bad-looking pass-on-the-stack code won on big, real code. Weird but true.)
I very much agree with that especially since - like you said - code that looks like it will perform well, not always does.
That being said I'd like to add that in my opinion performance measurement results should not be the only guiding principle.
You said it yourself: "Another reason is that the CPUs of today are optimized [..]"
The important word is "today". CPUs evolved and still do and a calling convention should be designed for the long term.
Sadly, it means that it is beneficial to not deviate too much from what C++ does [1], because it is likely that future processor optimizations will be targeted in that direction.
Apart from that it might be worthwhile to consider general principles that are not likely to change (e.g. conserve argument registers, as you mentioned), to make the calling convention robust and future proof.
[1] It feels a bit strange, when I say that because I think Rust has become a bit too conservative in recent years, when it comes to its weirdness budget (https://steveklabnik.com/writing/the-language-strangeness-bu...). You cannot be better without being different, after all.
The Rust calling convention is actually defined as unstable, so 1.79 is allowed to have a different calling convention than 1.80 and so on. I don't think designing one for the long term is a real concern right now.
6 replies →
> means that it is beneficial to not deviate too much from what C++ does
Or just C.
Reminds me when I looked up SIMD instructions for searching string views. It was more performant to slap a '\0' on the end and use null terminated string instructions than to use string view search functions .
1 reply →
> and a calling convention should be designed for the long term
...isn't the article just about Rust code calling Rust code? That's a much more flexible situation than calling into operating system functions or into other languages. For calling within the same language a stable ABI is by for not as important as on the 'ecosystem boundaries', and might actually be harmful (see the related drama in the C++ world).
1 reply →
Yep. Also whether passing in registers is faster or not also depends on the function body. It doesn't make much sense if the first thing the function does is to take the address of the parameter and passes it to some opaque function. Then it needs to be spilled onto the stack anyway.
It would be interesting to see calling convention optimizations based on function body. I think that would be safe for static functions in C, as long as their address is not taken.
Dynamic calling conventions also won't work with dynamic linking
1 reply →
Your experience is not perfectly transferable. JITs have it easy on this because they've already gathered a wealth of information about the actually-executing-on CPU by the time they generate a single line of assembly. Calls appear on the hot path more often in purely statically compiled code because things like the runtime architectural feature set are not known, so you often reach inlining barriers precisely in the code that you would most like to optimize.
LLVM inlines even more than my JIT does.
The JIT has both more and less information.
It has more information about the program globally. There is no linking or “extern” boundary.
But whereas the AOT compiler can often prove that it knows about all of the calls to a function that could ever happen, the JIT only knows about those that happened in the past. This makes it hard (and sometimes impossible) for the JIT to do the call graph analysis style of inlining that llvm does.
One great example of something I wish my jit had but might never be able to practically do, but that llvm eats for breakfast: “if A calls B in one place and nothing else calls B, then inline B no matter how large it is”.
(I also work on ahead of time compilers, though my impact there hasn’t been so big that I brag about it as much.)
1 reply →
The people who write JITs also write a bunch of C++ that gets statically compiled.
And remember that performance can include binary size, not just runtime speed. Current Rust seems to suffer in that regard for small platforms, calling convention could possibly help there wrt Result returns.
The current calling convention is terrible for small platforms, especially when using Result<> in return position. For large enums, the compiler should put the discriminant in a register and the large variants on the stack. As is, you pay a significant code size penalty for idiomatic rust error handling.
2 replies →
Also a thing you gotta measure.
Passing a lot of stuff in registers causes a register shuffle at call sites and prologues. Hard to predict if that’s better or worse than stack spills without measuring.
"If you generate code that looks like what the C compiler would do - which passes on the stack surprisingly often, especially if you’re MSVC - then you hit the CPU’s sweet spot somehow."
The FA is mostly about x86 and Intel indeed did an amazing amount of clever engineering over decades to allow your ugly x86 code to run fast on their silicon that you buy.
Still, does your point about the empirical benefit of passing on the stack continue to apply with a transition to register rich ARMV8 CPUs or RISC-V?
ARM follows its own calling convention, which by default uses registers for both argument and return value passing [1], so these lessons likely do not apply.
[1] https://developer.arm.com/documentation/dui0041/c/ARM-Proced...
Yes.
If you flatten big structs into registers to pass them you have a bad time on armv8.
I tried. That was an llvm experiment. Ahead of time compiler for a modified version of C.
If you want fast, then you probably need to have a different calling convention per call.
Reasonable sketch. This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.
It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.
Changing ABI with optimization setting interacts really badly with separate compilation.
Shuffling arguments around in bin packing fashion does work but introduces a lot of complexity in the compiler, not sure it's worth it relative to left to right first fit. It also makes it difficult for the developer to predict where arguments will end up.
The general plan of having different calling conventions for addresses that escape than for those that don't is sound. Peeling off a prologue that does the impedance matching works well.
Rust probably should be willing to have a different calling convention to C, though I'm not sure it should be a hardcoded one that every function uses. Seems an obvious thing to embed in the type system to me and allowing developer control over calling convention removes one of the performance advantages of assembly.
> This is missing the caller/called save distinction and makes the usual error of assigning a subset of the input registers to output.
Out of curiosity, what's so problematic about using some input registers as output registers? On the caller's side, you'd want to vacate the output registers between any two function calls regardless. And it occurs pretty widely in syscall conventions, to my binary-golfing detriment.
Is it for the ease of the callee, so that it can set up the output values while keeping the input values in place? That would suggest trying to avoid overlap (by placing the output registers at the end of the input sequence), but I don't see how it would totally contraindicate any overlap.
You should use all the input registers as output registers, unless your arch is doing some sliding window thing. The x64 proposal linked uses six to pass arguments in and three to return results. So returning six integers means three in registers, three on the stack, with three registers that were free to use containing nothing in particular.
10 replies →
Allowing developer control over calling conventions is also simultaneous with disallowing optimization in the case that Function A calls Function B calls Function C calls Function D etc. but along the way one or more of those functions could have their arguments swapped around to a different convention to reduce overhead. What semantics would preserve such an optimization but allow control? Would it just be illusory?
And in practice assembly has the performance disadvantage of not being subject to most compiler optimizations, often including "introspecting on its operation, determining it is fully redundant, and eliminating it entirely". It's not the 1990s anymore.
In the cases where that kind of optimization is not even possible to consider, though, the only place I'd expect inline assembly to be decisively beaten is using profile-guided optimization. That's the only way to extract more information than "perfect awareness of how the application code works", which the app dev has and the compiler dev does not. The call overhead can be eliminated by simply writing more assembly until you've covered the relevant hot boundaries.
If those functions are external you've lost that optimisation anyway. If they're not, the compiler chooses whether to ignore your annotation or not as usual. As is always the answer, the compiler doesn't get to make observable changes (unless you ask it to, fwrong-math style).
I'd like to specify things like extra live out registers, reduced clobber lists, pass everything on the stack - but on the function declaration or implementation, not having to special case it in the compiler itself.
Sufficiently smart programmers beat ahead of time compilers. Sufficiently smart ahead of time compilers beat programmers. If they're both sufficiently smart you get a common fix point. I claim that holds for a jit too, but note that it's just far more common for a compiler to rewrite the code at runtime than for a programmer to do so.
I'd say that assembly programmers are rather likely to cut out parts of the program that are redundant, and they do so with domain knowledge and guesswork that is difficult to encode in the compiler. Both sides are prone to error, with the classes of error somewhat overlapping.
I think compilers could be a lot better at codegen than they presently are, but the whole "programmers can't beat gcc anymore" idea isn't desperately true even with the current state of the art.
Mostly though I want control over calling conventions in the language instead of in compiler magic because it scales much better than teaching the compiler about properties of known functions. E.g. if I've written memcpy in asm, it shouldn't be stuck with the C caller save list, and avoiding that shouldn't involve a special case branch in the compiler backend.
> It's optimistic about debuggers understanding non-C-like calling conventions which I'd expect to be an abject failure, regardless of what dwarf might be able to encode.
DWARF doesn't encode bespoke calling conventions at all today.
The bin packing will probably make it slower though, especially in the bool case since it will create dependency chains. For bools on x64, I don‘t think there‘s a better way than first having to get them in a register, shift them and then OR them into the result. The simple way creates a dependency chain of length 64 (which should also incur a 64 cycle penalty) but you might be able to do 6 (more like 12 realistically) cycles. But then again, where do these 64 bools come from? There aren‘t that many registers so you will have to reload them from the stack. Maybe the rust ABI already packs bools in structs this tightly so it‘s work that has to be done anyway but I don‘t know too much about it.
And then the caller will have to unpack everything again. It might be easier to just teach the compiler to spill values into the result space on the stack (in cases the IR doesn‘t already store the result after the computation) which will likely also perform better.
Unpacking bools is cheap - to move any bit into a flag is just a single 'test' instruction, which is as good as it gets if you have multiple bools (other than passing each in a separate flag, which is quite undesirable).
Doing the packing in a tree fashion to reduce latency is trivial, and store→load latency isn't free either depending on the microarchitecture (and at the counts where log2(n) latency becomes significant you'll be at IPC limit anyway). Packing vs store should end up at roughly the same instruction counts too - a store vs an 'or', and exact same amount of moving between flags ang GPRs.
Reaching 64 bools might be a bit crazy, but 4-8 seems reasonably attainable from each of many arguments being an Option<T>, where the packing would reduce needed register/stack slot count by ~2.
Where possible it would of course make sense to pass values in separate registers instead of in one, but when the alternative is spilling to stack, packing is still worthy of consideration.
4 replies →
Also, most modern processors will easily forward the store to the subsequent read and has a bunch of tricks for tracking the stack state. So much does putting things in registers help anyway?
Forwarding isn't unlimited, though, as I understand it. The CPU has limited-size queues and buffers through which reordering, forwarding, etc. can happen. So I wouldn't be surprised if using registers well takes pressure off of that machinery and ensures that it works as you expect for the data that isn't in registers.
(Looked around randomly to find example data for this) https://chipsandcheese.com/2022/11/08/amds-zen-4-part-2-memo... claims that Zen 4's store queue only holds 64 entries, for example, and a 512-bit register store eats up two. I can imagine how an algorithm could fill that queue up by juggling enough data.
2 replies →
More broadly: processor design has been optimised around C style antics for a long time, trying to optimise the code produced away from that could well inhibit processor tricks in such a way that the result is _slower_ than if you stuck with the "looks terrible but is expected & optimised" status quo
1 reply →
Register allocation decisions routinely result in multi-percent performance changes, so yes, it does.
Also, registers help the MachineInstr-level optimization passes in LLVM, of which there are quite a few.
The C calling convention kind of sucks. True, can't change the C calling convention, but that doesn't make it any less unfortunate.
We should use every available caller-saved register for arguments and return values, but in the traditional SysV ABI, we use only one register (sometimes two) for return values. If you return a struct Point3D { long x, y, z }, you spill the stack even though we could damned well put Point3D in rax, rdi, and rsi.
There are other tricks other systems use. For example, if I recall correctly, in SBCL, functions set the carry flag on exit if they're returning multiple values. Wouldn't it be nice if we used the carry flag in indicate, e.g. whether a Result contains an error.
"sucks" is a strong word but with respect to return values, you're right. The C calling conventions, everywhere really, support what C supports - returning one argument. Well, not even that (struct returns ... nope). Kind of "who'd have thought" in C I guess. And then there's the C++ argument "just make it inline then".
On the other hand, memory spills happen. For SPARC, for example, the gracious register space (windows) ended up with lots of unused regs in simple functions and a cache-busting huge stack size footprint, definitely so if you ever spilled the register ring. Even with all the mov in x86 (and there is always lots of it, at least in compiled C code) to rearrange data to "where it needed to be", it often ended up faster.
When you only look at the callee code (code generated for a given function signature), it's tempting to say "oh it'll definitely be fastest if this arg is here and that return there". You don't know the callers though. There's no guarantee the argument marshalling will end up "pass through" or the returns are "hot" consumed. Say, a struct Point { x: i32, y: i32, z: i32 } as arg/return; if the caller does something like mystruct.deepinside.point[i] = func(mystruct.deepinside.point[i]) in a loop then moving it in/out of regs may be overhead or even prevent vectorisation. But the callee cannot know. Unless... the compiler can see both and inline (back to the C++ excuse). Yes, for function call chaining javascript/rust style it might be nice/useful "in principle". But in practice only if the compiler has enough caller/callee insight to keep the hot working set "passthrough" (no spills).
The lowest hanging fruit on calling is probably to remove the "functions return one primitive thing" that's ingrained in the C ABIs almost everywhere. For the rest ? A lot of benchmarking and code generation statistics. I'd love to see more of that. Even if it's dry stuff.
> Well, not even that (struct returns ... nope).
C compilers actually pack small struct return values into registers:
https://godbolt.org/z/51q5se86s
It's just limited that on x86-64, GCC and Clang use up to two registers while MSVC only uses one.
Also, IMHO there is no such thing as a "C calling convention", there are many different calling conventions that are defined by the various runtime environments (usually the combination of CPU architecture and operating system). C compilers just must adhere to those CPU+OS calling conventions like any other language that wants to interact directly with the operating system.
IMHO the whole performance angle is a bit overblown though, for 'high frequency functions' the compiler should inline the function body anyway. And for situations where that's not possible (e.g. calling into DLLs), the DLL should expose an API that doesn't require such 'high frequency functions' in the first place.
2 replies →
Tangentially related, there's another "unfortunate" detail of Rust that makes some structs bigger than you want them to be. Imagine a struct Foo that contains eight `Option<u8>` fields, ie each field is either `None` or `Some(u8)`. In C, you could represent this as a struct with eight 1-bit `bool`s and eight `uint8_t`s, for a total size of 9 bytes. In Rust however, the struct will be 16 bytes, ie eight sequences of 1-byte discriminant followed by a `uint8_t`.
Why? The reason is that structs must be able to present borrows of their fields, so given a `&Foo` the compiler must allow the construction of a `&Foo::some_field`, which in this case is an `&Option<u8>`. This `&Option<u8>` must obviously look identical to any other `&Option<u8>` in the program. Thus the underlying `Option<u8>` is forced to have the same layout as any other `Option<u8>` in the program, ie its own personal discriminant bit rounded up to a byte followed by its `u8`. The struct pays this price even if the program never actually constructs a `&Foo::some_field`.
This becomes even worse if you consider Options of larger types, like a struct with eight `Option<u16>` fields. Then each personal discriminant will be rounded up to two bytes, for a total size of 32 bytes with a quarter (or almost half, if you include the unused bits of the discriminants) being wasted interstitial padding. The C equivalent would only be 18 bytes. With `Option<u64>`, the Rust struct would be 128 bytes while the C struct would be 72 bytes.
You *can* implement the C equivalent manually of course, with a `u8` for the packed discriminants and eight `MaybeUninit<T>`s, and functions that map from `&Foo` to `Option<&T>`, `&mut Foo` to `Option<&mut T>`, etc, but not to `&Option<T>` or `&mut Option<T>`.
https://play.rust-lang.org/?version=stable&mode=debug&editio...
You have to implement the C version manually, so it's not that odd you'd need to do the same for Rust?
You've described, basically, a custom type that is 8 Options<u8>s. If you start caring about performance you'll need to roll your own internal Option handling.
>You have to implement the C version manually
There's no "manually" about it. There's only one way to implement it in C, ie eight booleans and eight uint8_ts as I described. Going from there to the further optimization of adding a `:1` to every `bool` field is a simple optimization. Reimplementing `Option` and the bitpacking of the discriminants is much more effort compared to the baseline implementation of using `Option`.
2 replies →
> You can implement the C equivalent manually of course
But you have to implement the C version manually as well.
It's not really a downside to Rust if it provides a convenient feature that you can choose to use if it fits your goals.
The use case you're describing is relatively rare. If it's an actual performance bottleneck then spending a little extra time to implement it in Rust doesn't seem like a big deal. I have a hard time considering this an "unfortunate detail" to Rust when the presence of the Option<_> type provides so much benefit in typical use cases.
I answered this in the other subthread already.
> If a non-polymorphic, non-inline function may have its address taken (as a function pointer), either because it is exported out of the crate or the crate takes a function pointer to it, generate a shim that uses -Zcallconv=legacy and immediately tail-calls the real implementation. This is necessary to preserve function pointer equality.
If the legacy shim tail calls the Rust-calling-convention function, won't that prevent it from fixing any return value differences in the calling convention?
Yes. People tend to forget about the return half of the calling convention though so it's an understandable typographical error.
Tangentially related: Is it currently possible to have interop between Go and Rust? I remember seeing someone achieving it with Zig in the middle but can’t for the sake of me find it. Have some legacy Rust code (what??) that I’m hoping to slowly port to Go piece by piece
Yes, you can use CGO to call Rust functions using extern "C" FFI. I gave a talk about how we use it for GitHub code search at RustConf 2023 (https://www.youtube.com/watch?v=KYdlqhb267c) and afterwards I talked to some other folks (like 1Password) who are doing similar things.
It's not a lot of fun because moving types across the C interop boundary is tedious, but it is possible and allows code reuse.
If you want to call from Go into Rust, you can declare any Rust function as `extern "C"` and then call it the same way you would call C from Go. Not sure about going the other way.
It's usually unwise to mix managed and unmanaged memory since the managed code needs to be able to own the memory its freeing and moving whereas the unmanaged code needs to reason about when memory is freed or moved. cgo (and other variants) let you mix FFI calls into unmanaged memory from managed code in Go, but you pay a penalty for it.
In language implementations where GC isn't shared by the different languages calling each other you're always going to have this problem. Mixing managed/unmanaged code is both an old idea and actively researched.
It's almost always a terrible idea to call into managed code from unmanaged code unless you're working directly with an embedded runtime that's been designed for it. And when you do, there's usually a serialization layer in between.
> It's usually unwise to mix managed and unmanaged memory
Broadly stated, you can achieve this by marking a managed object as a GC root whenever it's to be referenced by unmanaged code (so that it won't be freed or moved in that case) and adding finalizers whenever managed objects own or hold refcounted references over unmanaged memory (so that the unmanaged code can reason about these objects being freed). But yes, it's a bit fiddly.
Mixing managed and unmanaged code being an issue is simply not true in programming in general.
It may be an issue in Go or Java, but it just isn't in C# or Swift.
Calling `write` in C# on Unix is as easy as the following snippet and has almost no overhead:
In addition, unmanaged->managed calls are also rarely an issue, both via function pointers and plain C exports if you build a binary with NativeAOT:
It is indeed true that more complex scenarios may require some form of bespoke embedding/hosting of the runtime, but that is more of a peculiarity of Go and Java, not an actual technical limitation.
15 replies →
I have to use Rust and Swift quite a bit. I basically just landed on sending a byte array of serialized protobufs back and forth with cookie cutter function calls. If this is your full time job I can see how you might think that is lame, but I really got tired of coming back to the code every few weeks and not remembering how to do anything.
If you want a particularity cursed example, I've recently called Go code from Rust via C in the middle, including passing a Rust closure with state into the Go code as callback into a Go stdlib function, including panic unwinding from inside the Rust closure https://github.com/Voultapher/sort-research-rs/commit/df6c91....
You have to go through C bindings, but FFI is very far from being Go's strongest suit (if we don't count Cgo), so if that's what interests you, it might be better to explore a different language.
I just spent a bunch of time on inspect element trying to figure out how the section headings are set at an angle and (at least with Safari tools), I’m stumped. So how did he do this?
The style is on the `.post-title` element: `transform: skewY(-2deg) translate(-1rem, -0.4rem);`
related, I thought the minimap was using the CSS element() function [0], but it turns out it's actually just a copy of the article shrunk down real small.
[0] https://developer.mozilla.org/en-US/docs/Web/CSS/element
h1, h2, h3, h4, h5, h6 { transform:skewY(-2deg) translate(-1rem,0rem); transform-origin:top; font-style:italic; text-decoration-line:underline; text-decoration-color:goldenrod; text-underline-offset:4%; text-decoration-thickness:.25ex }
In contrast: "How Swift Achieved Dynamic Linking Where Rust Couldn't " (2019) [1]
On the one hand I'm disappointed that Rust still doesn't have a calling convention for Rust-level semantics. On the other hand the above article demonstrates the tremendous amount of work that's required to get there. Apple was deeply motivated to build this as a requirement to make Swift a viable system language that applications could rely on, but Rust does not have that kind of backing.
[1] https://news.ycombinator.com/item?id=21488415
It's only fair to point out that Swift's approach has runtime costs. It would be good to have more supported options for this tradeoff in Rust, including but not limited to https://github.com/rust-lang/rfcs/pull/3470
Notably these runtime costs only occur if you’re calling into another library. For calls within a given swift library, you don’t incur the runtime costs: size checks are elided (since size is known), calls can be inlined, generics are monomorphized… the costs only happen when you’re calling into code that the compiler can’t see.
Given that the current Rust compiler does aggressive inlining and then optimizes, is this worth the trouble? If the function being called is tiny, it should be inlined. If it's big, you're probably going to spend some time in it and the call overhead is minor.
Runtime functions (eg dyn Trait) can’t be inlined for one, so this would help there. But also if you can make calls cheaper then you don’t have to be so aggressive with inlining, which can help with code size and compile times.
Probably? A complex function that’s not a good fit for inlining will probably access memory a few times and those accesses are likely to be the bottlenecks for the function. Passing on the stack squeezes that bottleneck tighter — more cache pressure, load/stores, etc. If Rust can pass arguments optimally in a decent ratio of function calls, not only is it avoiding the several clocks of L1 access, it’s hopefully letting the CPU get to those essential memory bottlenecks faster. There are probably several percentage points of win here…? But I am drinking wine and not doing the math, so…
Can someone explain the “Diana’s silk dress cost $89” mnemonic on x86 reference?
https://csappbook.blogspot.com/2015/08/dianes-silk-dress-cos...
Hah thanks!
Suppose I should have know that one, but never did meaningful assembly programming beyond 6502 and ARM32.
There was an interesting aproach to this, in an experimental language some time ago
The first one was mostly for interacting with C code, and the compiler knew how to call each function.
Delphi, and I'm sure others, have had[1] this for ages:
When you declare a procedure or function, you can specify a calling convention using one of the directives register, pascal, cdecl, stdcall, safecall, and winapi.
As in your example, cdecl is for calling C code, while stdcall/winapi on Windows for calling Windows APIs.
[1]: https://docwiki.embarcadero.com/RADStudio/Sydney/en/Procedur...
Since the Turbo Pascal days actually.
3 replies →
C for example does this, albeit in compiler extensions, and with a longer tag than #.
Terrible taste. Why would you hide such an infrequently used feature behind a single character? In this case you should absolutely use a keyword.
Is it similar to Zig’s callconv keyword?
Guess so. Unfamiliar with Zig. The point is that not a "all or nothing" strategy for a compilation unit.
Debugger writers may not be happy, but maybe lldb supports all conventions supported by llvm.
1 reply →
> Debuggers
Simply throw it in as a Cargo.toml flag and sidestep the worry. Yes, you do sometimes have to debug release code - but there you can use the not-quite-perfect debugging that the author mentions.
Also, why aren't we size-sorting fields already? That seems like an easy optimization, and can be turned off with a repr.
> Also, why aren't we size-sorting fields already?
We are for struct/enum fields. https://camlorn.net/posts/April%202017/rust-struct-field-reo...
There's even an unstable flag to help catch incorrect assumptions about struct layout. https://github.com/rust-lang/compiler-team/issues/457
Oh wow, awesome! I was somewhat considering this being my first contribution - glad someone already tackled it!
1 reply →
Did you want alignment sorting? In general the problem with things like that is the ideal layout is usually architecture and application specific - if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.
> Did you want alignment sorting?
Yep. It will probably improve (to be measured) the 80%. Less memory means less bandwidth usage etc.
> if my struct has padding in it to push elements onto different cache lines, I don't want the struct reordered.
I did suggest having a repr for situations like yours. Something like #[repr(yeet)]. Optimizing for false sharing etc. is probably well within 5% of code that exists today, and is usually wrapped up in a library that presents a specific data structure.
Very interesting but pretty quickly went over my head. I have a question that is slightly related to SIMD and LLVM.
Can someone explain simply where does MLIR fit into all of this? Does it standardize more advanced operations across programming languages - such as linear algebra and convolutions?
Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).
> Side-note: Mojo has been designed by the creator of LLVM and MLIR to prioritize and optimize vector hardware use, as a language that is similar to Python (and somewhat syntax compatible).
Are people getting paid to repeat this ad nauseum?
MLIR includes a "linalg" dialect that contains common operations. You can see those here: https://mlir.llvm.org/docs/Dialects/Linalg/
This post is rather unrelated. The linalg dialect can be lowered to LLVM IR, SPIR-V, or you could write your own pass to lower it to e.g. your custom chip.
> Can someone explain simply where does MLIR fit into all of this?
It doesn't.
MLIR is a design for a family of intermediate languages (called 'dialects') that allow you to progressively lower high-level languages into low-level code.
The ML media cycle is so unhinged that I've seen people simply assume out of hand that MLIR stands for Machine Learning Intermediate Representation.
interesting website - the title text is slanted.
Sometimes people who dig deep into the technical details end up being creative with those details.
True, creative, but usually in a quality degrading way like here (slanted text is harder to read, also due to the underline being too thick, and takes more space) or like with those poorly legible bg/fg color combinations
https://www.contrastrebellion.com
good
Meta: the minimap is quite interesting, it's 'just' a smaller copy of all the content.
Clever! Should probably have aria-hidden though.