Comment by stabbles

1 year ago

Maybe the example is too simple, but it does not require `__attribute__((musttail))` for good code gen.

Also if the error handling function is unlikely, you wouldn't care too much about how fast it is to call it?

To me it seems like a structure of the form

   <read data>

   if (unlikely(malformed))
     return error();

   <prep more stuff>

   switch (data_type) {
     case x:
       return handle_x();
     case y:
       return handle_y();
   }

generates a nice jump table quite reliably.

Obviously compilers have been doing tail call elimination for ever, but for this trick "generates [tail calls] quite reliably" is not enough: you have to GUARANTEE it (or fail compilation), otherwise this structure does not work (it will blow the stack immediately). That's the point of [[musttail]], tail call elimination is required, that's the only choice the compiler has.

  • "you have to GUARANTEE it (or fail compilation)"

    I've often pondered the utility of similar flags for other optimizations. This is perhaps the largest one, but there are other situations in certain code where I want to know that my optimization has failed.

    A more complicated example is, I've sometimes wanted an assertion that a given function is inlined. We know from hard, repeated experience over decades that letting users annotate functions as inline directly doesn't end up working very well, but I've often wondered about creating an assertion that would fail the compile if it isn't. (Ideally with at least a hint as to why the compiler failed it out, but that's easier said than done.) Obviously you don't want to go slapping this in some major open source library that's going to be compiled by half-a-dozen compilers on dozens of operating systems, but for my own code in my own situation it can be an optimization that is the difference between success and failure and it'd be nice to flag a failure.

    (Bear in mind I am not proposing to blindly take any particular action if it triggers. The point is to bring it up to human attention and not have it buried deep in invisible, yet potentially consequential, compiler decisions. The human deciding "I guess this optimization isn't happening" and removing the annotation would be one valid decision, for instance.)

    • I agree, this would be useful. Another one I would like is auto-vectorization, a way to mark a loop with an attribute and if it fails to auto-vectorize, the compiler should print out the auto-vectorization report for that loop, explaining why it happened. It's such a brittle optimization, but it's crucial for a tiny number of extremely hot loops, you would want to know if it failed due to some code change or compiler upgrade. Also, it's just a pain to use auto-vectorization report normally.

    • > I've sometimes wanted an assertion that a given function is inlined.

      Try `__attribute__((error("not inlined")))` or `warning` on the callee.

  • That means that any such code is not portable across compilers anymore. It is effectively written in a non-standard C dialect, because it requires a language extension to work correctly.

    • The typical way to deal with this is to put the __attribute__ into a C macro which expands to nothing on compilers which don't understand GCC/Clang's __attribute__ keyword. The code without the attribute will still compile and most likely also apply the tail call optimization, you just don't get an error if the compiler can't apply the optimization.

      Also TBF, hardly any real-world C code is strictly standard compliant. Many C compilers just agree on a common syntax that includes both the C standard and some popular non-standard extensions.

      PS: C++ compilers actually ignore unknown attributes since the `[[attribute]]` syntax has been standardized in C++11. In GCC and Clang you'll get a warning in the standard warning set, but not in MSVC.

      PPS: C23 also standardized the `[[attribute]]` syntax and also added a way to check for supported attributes:

      https://en.cppreference.com/w/c/language/attributes

      5 replies →

    • Yes, that is correct. You cannot do this trick in standard C, C++ or Rust, it requires some version of [[musttail]]. Strong argument for adding it to the C standard, IMHO.

    • Fwiw, many C projects are written in a non-standard C dialect, including the Linux kernel.

    • The article is pretty clear about this. When it comes to fast lexing and parsing, it is typical for projects to make portability tradeoffs in favor of performance. For example, simdjson is full of assembly.

      1 reply →

    • There's no such thing as "standard C" that you can actually write, due to UB and implementation defined behaviour. There's just (C, compiler version, platform) that defines (if only through the compiler's source code) what will actually happen in any given situation.

      7 replies →

As the disassembly in the post demonstrates, the problem with the fallback path (which is not necessarily the error path) is not how fast the call to it is, it's that the mere existence of that call can force the compiler to create a stack frame and spill registers into it for the whole function, including the fast path.

OK, maybe “force” is not the right word—nobody says the compiler has to have a single stack frame structure for all possible execution paths of a function. Nobody even says it has to use the standard ABI for a no-linkage (static or anonymous-namespace) function (that doesn’t have its address taken). But the reality is, all compilers I’ve seen do, including Clang, so we want a way to tell them to not worry about the ABI and avoid wasting time on preserving registers across the call.

Re your nice jump table, sure it does. But if you try running the result under, say, perf report, and your test bytecode doesn’t describe a short loop, you’ll see one of two things: either you had a branch mispredict on each dispatch; or the compiler went “looks like you’re trying to write an interpreter” and moved the indirect jump to the end of each case (I’ve seen Clang do this). And either way the register allocation in the resulting code probably sucks.

  • > so we want a way to tell them to not worry about the ABI and avoid wasting time on preserving registers across the call

    that's what -fvisibility=internal already does, no?

    • That’s what static could do (if the function’s address is not taken, or given sufficiently powerful dataflow analysis), but C and C++ compilers don’t take advantage of that. Getting that out of -fvisibility=hidden -flto would also be possible, but requires even more nonexistent compiler smarts. (From a quick web search, I can't figure out what internal visibility brings over hidden.)

      (Granted, it’s not like this is completely impossible—I seem to remember GHC and MLton can invent custom calling conventions for Haskell and SML respectively. But the popular C or C++ compilers can’t.)

      3 replies →