Parsing Protobuf at 2+GB/S: How I Learned to Love Tail Calls in C

5 years ago (blog.reverberate.org)

200 comments

signa11

A very interesting trick! In my opinion, the big takeaway here is that if you are willing to write your C code in tail call style, you can get a lot of control over which variables are stored in machine registers. In the x86-64 ABI, the first 6 function parameters are guaranteed to be passed via registers.

Obviously, this is architecture-specific. Other architectures may use a different calling convention that ruins this kind of manual register allocation. I'd imagine that it would be bad for performance in 32-bit systems, where function arguments are passed via the stack.

--------

> I think it’s likely that all of the major language interpreters written in C (Python, Ruby, PHP, Lua, etc.) could get significant performance benefits by adopting this technique.

I know that at least Lua and Python use "computed gotos" in their inner loops, which also helps the compiler generate better code. The architecture-dependent nature of the tail-call trick could be a problem here. Some of these interpreters also need to work well on 32-bit systems.

haberman 5 years ago
Yep, that's exactly the right takeaway. :) I have verified this on ARM64 also, but architectures that pass parameters on the stack will make this technique not really worth it.
Re: PHP, Ruby, etc. yes, computed gotos help some, but I think tail calls would help much more. That was our experience at least. I wrote more about this here: https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
Yes, there are portability issues with the tail call approach, so there would need to be a fallback on non-x64/ARM64 platorms. This would add complexity. But it's exciting to think that it could unlock significant performance improvements.
- spockz 5 years ago
  
  If you like tail calls, look into CPS. Many forms of (pure) code can be rewritten in such a way.
  Everyone who writes Haskell quickly learns to write their recursive functions so that they are (mutually) tail recursive.
  
  2 replies →
- ufo 5 years ago
  
  On the matter of portability, I wonder if it would be possible to use some macro magic and/or a code generator to convert the tail-call version back into a more traditional while/switch loop, for the stack-based architectures.
  While the tail call version is more architecture dependent, it's nevertheless more portable than assembly language. It's still C.
  
  2 replies →
mappu 5 years ago
Regarding PHP specifically, the interpreter's opcodes are all described in a meta language, and the actual PHP interpreter can be code-generated in 4 different styles depending on what is fastest for the target C compiler.
See ZEND_VM_KIND_CALL / SWITCH / GOTO / HYBRID in php-src/Zend/zend_vm_gen.php
- jashmatthews 5 years ago
  
  Even more wild is the JavaScriptCore bytecode VM which is written in a Ruby DSL which is either compiled to C or directly to ASM.
  I'm jealous we don't have something like this for CRuby!
kevincox 5 years ago
> the first 6 function parameters are guaranteed to be passed via registers.
This assumes that your function is being called via the regular calling convention. By the as-if rule there is nothing guaranteeing that an internal call is using the regular calling convention or that it hasn't been inlined.
- anarazel 5 years ago
  
  Are there cases where a compiler reasonably would internally use a different calling convention that supports fewer arguments passed via register passing?
  Yours is still a valid point (no reason doesn't imply a guarantee) even if there's none, to be clear. Just curious because I can't really see any reason for a compiler to do so.
  I can imagine some theoretical cases where compiler optimizations lead to additional arguments to be passed to a function [version].
  
  2 replies →
- haberman 5 years ago
  
  We use the "noinline" attribute to get control over inlining, which gives a reasonable assurance that we'll get a normal standard ABI call with six parameters in registers on x64/ARM64.
  
  1 reply →
tomp 5 years ago
Modern compilers can compile tail calls of unknown functions (i.e. function pointers) to jumps; so instead of using "computed gotos" (which IIRC is a GCC extension), one can use ANSII C and get more efficient code (because of control of registers)
- junon 5 years ago
  
  > which IIRC is a GCC extension.
  Correct, and it's always been a gaping hole in the standard, in my opinion. Computed GOTO would be one of the more useful things added to C.
ceronman 5 years ago
I'm not a C programmer at all, so please forgive me if this question doesn't make sense. If I remember correctly, there is a `register` keyword in C that you can use to hint the compiler that a given variable should be stored in a register. I think I've never seen this keyword in use, and I'm wondering why. But I think it would be useful for this same use case, without requiring tail calls and with the advantage of being fully portable.
- junon 5 years ago
  
  `register` was used as a hint, and things like LLVM have Mem2Reg optimization passes that do a much better job at this sort of promotion.
  IIRC register was mainly used in cases where hardware were involved, much like the volatile keyword. However unlike volatile, I don't believe register is even considered by compilers these days.
iamevn 5 years ago

Wouldn't you still be able to get a lesser improvement from the tail call being able to overwrite the stack frame and jumping instead of calling?

AndyKelley 5 years ago

Here's a related Zig proposal: https://github.com/ziglang/zig/issues/8220

Relevant section pasted here:

> Other Possible Solution: Tail Calls

> Tail calls solve this problem. Each switch prong would return foo() (tail call) and foo() at the end of its business would inline call a function which would do the switch and then tail call the next prong.

> This is reasonable in the sense that it is doable right now; however there are some problems:

   * As far as I understand, tail calls don't work on some architectures.

     * (what are these? does anybody know?)

   * I'm also concerned about trying to debug when doing dispatch with tail calls.

   * It forces you to organize your logic into functions. That's another jump that
     maybe you did not want in your hot path.

See also https://dino-lang.github.io/learn/dino.pdf, section 3.1 "Byte Code Dispatch".

haberman 5 years ago
It sounds like the main question in the proposal is: how can we reliably get the compiler to generate threaded/replicated dispatch?
While that is important, it is only one small part of what we were trying to achieve with tail calls. Ultimately we found that our tail call design significantly improved the register allocation and overall code generation compared with computed goto, see: https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
- seg_lol 5 years ago
  
  I really want a timeline where we have tail calls everywhere and this drama can go away. The non-tail call folks feel like folks arguing for not using spaces in text.
  https://en.wikipedia.org/wiki/Scriptio_continua
  
  7 replies →

neilv 5 years ago

Scheme has had tail calls as a basic idiom (some call it Proper Implementation of Tail Calls; others call it Tail Call Optimization) forever, and I keep them in mind any time I'm implementing anything nontrivial in Scheme.

There's no special syntax -- there's just the notion of tail positions, from which tail calls can occur. For example, both arms of an `if` form are tail position, if the `if` itself is. And if you introduce a sequencing block in one of those arms, the last position in the block is tail position. (A specialized IDE like DrRacket can indicate tail positions, and DrRacket will even hover arrows, tracing the tail positions back to the top of the code.)

When implementing a state machine, for example, a state transition is simply a function application (call) to the new state (optionally with arguments), where the the called function represents a state. It's satisfying when you realize how elegant something that used to be messy is, and you can imagine it being very fast (even if your Scheme implementation isn't quite up to getting as fast as a C or Rust compiler that can recognize and special-case tail calls.)

(For the state machine example, compare to more conventional approaches in C ,of "state ID" variables, `switch` statements on those, and record-keeping for additional state. Or doing it in data, with state ID being index into arrays, again with any additional recordkeeping. Or lots of function calls when you didn't really need the time overhead and stack usage of function calls with full returns.)

CodeArtisan 5 years ago
Any functional programming language shall have TCO since statements are forbidden (everything is a function).
- lmm 5 years ago
  
  Or you could avoid (semantically) having a stack at all, as Haskell does.
  
  3 replies →
bjoli 5 years ago

Recursive descent parsers are, in a way, state machines as well. Writing something like a CSV or JSON parser in scheme is so much nicer than having to do it the old fashioned way.

ma2rten 5 years ago

Protobufs are very important for Google. A significant percentage of all compute cycles is used on parsing protobufs. I am surprised that the parsing is not doing using handwritten assembly if it's possible to improve performance so much.

CoolGuySteve 5 years ago
Protobuf's abysmal performance, questionable integration into the C++ type system, append-only expandability, and annoying naming conventions and default values are why I usually try and steer away from it.
As a lingua franca between interpreted languages it's about par for the course but you'd think the fast language should be the fast path (ie: zero parsing/marshalling overhead in Rust/C/C++, no allocations) as you're usually not writing in these languages for fun but because you need the thing to be fast.
It's also the kind of choice that comes back to bite you years into a project if you started with something like Python and then need to rewrite a component in a systems language to make it faster. Now you not only have to rewrite your component but change the serialization format too.
Unfortunately Protobuf gets a ton of mindshare because nobody ever got fired for using a Google library. IMO it's just not that good and you're inheriting a good chunk of Google's technical debt when adopting it.
- haberman 5 years ago
  
  Zero parse wire formats definitely have benefits, but they also have downsides such as significantly larger payloads, more constrained APIs, and typically more constraints on how the schema can evolve. They also have a wire size proportional to the size of the schema (declared fields) rather than proportional to the size of the data (present fields), which makes them unsuitable for some of the cases where protobuf is used.
  With the techniques described in this article, protobuf parsing speed is reasonably competitive, though if your yardstick is zero-parse, it will never match up.
  
  8 replies →
- lmeyerov 5 years ago
  
  We jumped from protobuf -> arrow in the very beginning of arrow (e.g., wrote on the main lang impls), and haven't looked back :)
  if you're figuring out serialization from scratch nowadays, for most apps, I'd def start by evaluating arrow. A lot of the benefits of protobuf, and then some
  
  7 replies →
- Cloudef 5 years ago
  
  The protobuf itself as format isnt that bad, just the default implementations are bad. Slow compile times, code bloat and clunky apis / conventions. Nanopb is much better implementation and allows you to control code generation better too. Protobuf makes sense for large data, but for small data fixed length serialization with compression applied on top probably would be better.
- gorset 5 years ago
  
  It's obviously possible to do protobuf with zero parsing/marshalling if you stick to fixed length messages and 4/8 byte fields. Not saying that's a good idea, since there are simpler binary encodings out there when you need that type of performance.
  
  2 replies →
- fnord123 5 years ago
  
  FWIW, the python protobuf library defaults to using the C++ implementation with bindings. So even if this is a blog post about implementing protobuf in C, it can also help implementations in other languages.
  But yes, once you want real high performance, protobuf will disappoint you when you benchmark and find it responsible for all the CPU use. What are the options to reduce parsing overhead? flatbuffers? xdr?
  
  1 reply →
xxpor 5 years ago
Handwritten ASM for perf is almost never worth it in modern times. C compiled with GCC/Clang will almost always be just as fast or faster. You might use some inline ASM to use a specific instruction if the compiler doesn't support generating it yet (like AVX512 or AES), but even for that there's probably an intrinsic available. You can still inspect the output to make sure it's not doing anything stupid.
Plus it's C so it's infinitely more maintainable and way more portable.
- astrange 5 years ago
  
  The x86 intrinsics are so hard to read because of terrible Wintel Hungarian naming conventions that I think it’s quite clearer to write your SIMD in assembly. It’s usually easy enough to follow asm if there aren’t complicated memory accesses anyway. The major issue is not having good enough debug info.
  
  1 reply →
- ma2rten 5 years ago
  
  But this seems to be an edge case where you have to rely on functional programming and experimental compiler flags to get the machine code that you want.
  Portability is typically not a big issue, because you can have a fallback C++ implementation.
pjmlp 5 years ago
Yet Microsoft was able to make a .NET implementation faster than the current Google's C++ one.
A proof that they don't care enough about protobuf parsing performance.
https://devblogs.microsoft.com/aspnet/grpc-performance-impro...
- haberman 5 years ago
  
  From your link:
  > Support for Protobuf buffer serialization was a multi-year effort between Microsoft and Google engineers. Changes were spread across multiple repositories.
  The implementation being discussed there is the main C# implementation from https://github.com/protocolbuffers/protobuf/tree/master/csha...
  
  6 replies →
barbazoo 5 years ago
In what kind of scenarios do they use Protobufs? I can think of messaging systems, streaming, RPC, that sort of thing?
- sour-taste 5 years ago
  
  Everything at google is built on RPCs between systems using protobufs
  
  6 replies →
mkoubaa 5 years ago
Given that fact I'm wondering if google ever researched custom chips or instruction sets for marshalling pbs, like the TPUs they worked on for ML.
- lupire 5 years ago
  
  Problem is once you parse the protobuf, you have to immediately do other computations on it in the same process. No one needs to parse protobufs all day long like an ML model or doing hashes for crypto.
  
  1 reply →
jlouis 5 years ago

I'm going to guess most of the time is being spent elsewhere in the systems they are looking at and it is rather rare they have a situation where the parser is dominating. Protobuf is already a winner compared to the JSON mess we're in.
PostThisTooFast 5 years ago

So important that they haven't bothered to create a protobuf generator for Kotlin, the primary development language for their own mobile operating system.

neoteric 5 years ago

One thing I've been thinking about in C++ land is that just how much the idiomatic usage of RAII actually prevents the compiler from doing it's own tail call optimization. Any object instantiated in automatic storage with a non-trivial destructor is basically guaranteeing the compiler _can't_ emit a tailcall. It's rather unfortunate, but perhaps worth it if the traideoff is well understood. Example: https://godbolt.org/z/9WcYnc8YT

ufo 5 years ago
You can still use this trick in C++ if you ensure that the return statements are outside the scopes that have the RAII objects. It's awkward but it's better than nothing. https://godbolt.org/z/cnbjaK85T
void foo(int x) { { MyObj obj; // ... } return bar(x); // tail call }
gumby 5 years ago
I am unable to understand this comment. You're saying that you can't generate a tail call by returning from the middle of a function which needs to clean things up. RAII* is merely syntactic sugar to write this control flow and make it mandatory.
Perhaps it's easier to think of tail-call semantics as simply implementing iteration: it's another way of expressing a for(...) loop. And if you used RAII in the body of your for loop you would expect the same thing.
- hvdijk 5 years ago
  
  I can understand the comment: even if the cleanup can be done before the call, allowing the call to be done as a tail call, the compiler will not move up the cleanup code for you, and having the cleanup code last is the default state. With local variables defined inside for loops, they are always destroyed before the next iteration.
- neoteric 5 years ago
  
  Absolutely, RAII is an abstraction (and a useful one), but it has a cost in that it prevents a form of useful optimization because cleanup is required at the destruction of the stack frame. You'd expect the same in C if you explicitly had to call a cleanup function on return from a call.
  What C++ does with RAII is make this tradeoff not obvious. std::unique_ptr is a great example to show this: colloquially a std::unique_ptr is "just a pointer", but it isn't in this case because it's non-trivial destructor prevents TCO.
skybrian 5 years ago

These tail-call functions are part of a program’s inner loop. It seems like we shouldn’t expect allocation within an inner loop to be fast? Other than local variables, that is.
In an interpreter, it seems like either you’re allocating outside the interpreter loop (the lifetime is that of the interpreter) or it’s within a particular (slow) instruction, or it’s part of the language being interpreted and the lifetime can’t be handled by RAII. There will be fast instructions that don’t allocate and slower ones that do.
Interpreters are a bit weird in that there are lots of instructions that do almost nothing so the overhead of the loop is significant, combined with having lots of complication inside the loop and little ability to predict what instruction comes next. This is unlike many loops that can be unrolled.
sesuximo 5 years ago
I suppose the compiler could reorder function calls if it can prove there is no change in behavior? If so, then it could hoist dtors above the call and emit a jump. I doubt any compilers do this.
- astrange 5 years ago
  
  I would hope musttail does this if it “must” be a tail.
  Actually, I need it to do this for something at my day job, guess I’ll look it up…

bla3 5 years ago

It does make code harder to debug when something goes wrong. Since tail calls create no stack frames, a call sequence f->g->h where g tail calls h will show up as a stack of (f, h) in the debugger. The fix is easy, just make MUSTTAIL expand to nothing in debug builds. But it's something to keep in mind, and it means your debug mode code will have different memory use characteristics.

haberman 5 years ago
That is true, but there is always https://rr-project.org for easy reverse debugging if you're having trouble figuring out where you came from.
If the alternative is to drop to assembly, a C-based approach seems quite easy to debug. You can just add printf() statements! Previously when I had been using assembly language or a JIT, I had to resort to techniques like this: https://blog.reverberate.org/2013/06/printf-debugging-in-ass...
- kevincox 5 years ago
  
  Of course RR is too expensive to be deployed in production builds. So if you are getting core dumps from "production" you won't have this information. So while RR helps it doesn't completely mitigate the tradeoff.
  
  2 replies →
- elcritch 5 years ago
  
  Perhaps a "ring logger" approach could be useful. Append the more useful bits of what would normally go into the stack frame but without the allocation overhead. Seems like a few memcpy's and modulo operations wouldn't hurt the code gen too much.
- jhgb 5 years ago
  
  > but there is always https://rr-project.org for easy reverse debugging if you're having trouble figuring out where you came from
  How did I not know about this? Thanks!
  
  1 reply →
ufo 5 years ago
Only if the alternative is non-tail function calls. Often the alternative to tail calls is a while loop, which also doesn't leave a stack trace.
- leoc 5 years ago
  
  As so often, the moral is that pushing your problems in-band doesn't make them go away.
lupire 5 years ago

Programmer's have traditionally relied on the stack as an (expensive yet incomplete) execution logging system, but the east fix for that is to use an actually log structure instead of the abusing the stack.
cryptonector 5 years ago

A while loop with a switch also creates no stack frames for each operation interpreted.
jhgb 5 years ago

Seems like not much of a fix if your code depends on it. In Scheme such behavior would break standard compliance (and lots of programs). And since presumably you'd use this attribute precisely when your code depended on it, disabling it may not be feasible.
Fortunately the patterns generated by such code are no more difficult to debug than simple loops in other languages, so the lack of "history" of your computation seems no more concerning with tail calls than it is with loops (or at least that's how I always looked at it). And people don't seem to be avoiding loops in programming merely for the reason of loops not keeping track of their history for debugging purposes.
OskarS 5 years ago

> The fix is easy, just make MUSTTAIL expand to nothing in debug builds.
That wouldn’t work. The final call is a recursive call to dispatch, if that is not a tail call it will blow the stack instantly.
pjmlp 5 years ago

Lisp and Scheme graphical debuggers do just fine tracking them.
ndesaulniers 5 years ago

If using DWARF for unwinding there is a tag IIRC that represents that inlining has occurred.

pyler 5 years ago

about jne issue, LLVM does it intentionally.

Check line 1521

https://llvm.org/doxygen/BranchFolding_8cpp_source.html

haberman 5 years ago
Wow thanks for the pointer! That seems unfortunate, I wonder if there is a way to evaluate whether the extra jump is actually worth it, and whether this optimization could be allowed.
- chandlerc1024 5 years ago
  
  At the very least, can check when the target isn't a basic block and thus it's a clear win. Will fix your case.
  I'm dubious about the whole thing though. Seems like it may day from when branching "down" vs. "up" mattered to branch prediction.
  
  2 replies →
ndesaulniers 5 years ago

Nice find! I wonder what they mean by changing branch direction?
andreareina 5 years ago

Why is that a problem? I'd figure that a short jump is basically free since the target is likely to be in the instruction cache? Is it an issue of readahead/speculation through multiple jumps?

slver 5 years ago

You'd get equivalent performance with "goto", but not depend on implicit compiler optimizations to save you from stack overflow.

I like the idea of tail calls, but I don't like it being an implicit "maybe" optimization of the compiler. Make it part of the standard, make the syntax explicit, and then both hands in.

haberman 5 years ago
You really wouldn't. We tried many approaches, including computed goto, but none of them were able to generate satisfactory code when compared with tail calls: https://gcc.gnu.org/pipermail/gcc/2021-April/235891.html
Also, it is not an implicit optimization when "musttail" support is available in the compiler.
- slver 5 years ago
  
  Tailcall optimization is compiled as a JMP, so is goto, so I'm curious where would the difference come from?
  
  1 reply →

tyingq 5 years ago

"Applying this technique to protobuf parsing has yielded amazing results: we have managed to demonstrate protobuf parsing at over 2GB/s, more than double the previous state of the art."

I was somewhat surprised to see that was state of the art for protobufs. Simdjson boasts faster throughput, without the benefit of the length encoded headers that are in protobufs. I looked for examples of protobufs using SIMD, but could only find examples of speeding up varint encode/decode.

haberman 5 years ago

Article author here. I dumped my test protobuf payload (7506 bytes) to JSON and got 19283 bytes.
So parsing this particular payload at 2GB/s would be equivalent to parsing the JSON version at 5.1GB/s.
SIMD doesn't end up helping protobuf parsing too much. Due to the varint-heavy nature of the format, it's hard to do very much SIMD parallelism. Instead our approach focuses on going for instruction-level parallelism, by trying to remove instruction dependencies as much as possible. With this design we get ~3.5 instructions per cycle in microbenchmarks (which represents a best case scenario).
jeffbee 5 years ago
Making a flagrantly wasteful data format and then using its bloated extent as the numerator in your benchmark will not exactly be a fair comparison. If a protobuf has a packed, repeated field that looks like \x0a\x02\x7f\x7f and json has instead { "myFieldNameIsBob": [ 127, 127 ] } the JSON interpreter has to be 20x faster just to stay even.
- tyingq 5 years ago
  
  That's true, would be interesting to see an "encoded entities per second" comparison. Or maybe a comparison with mostly stringy data where the size is probably comparable.
  
  3 replies →
kelnos 5 years ago
I don't know if this is a fair comparison, as 2GB/s of protobuf will parse a lot more information than 2GB/s of JSON will, since protobuf is a much more space-efficient way to encode your data.
- tyingq 5 years ago
  
  See https://news.ycombinator.com/item?id=26934063 , we may get a fairly sound comparison. Though I imagine the comparison varies a lot depending on the data. As another comment mentioned, a message with a lot of large strings would lean heavily towards protobufs. And in that case, the size wouldn't be much different.
  
  1 reply →
jsnell 5 years ago
Isn't there also a significant difference in what the input is being parsed to? My expectation is for a protobuf library to parse messages to structs, with names resolved and giving constant time field access. Simdjson parses a json object to an iterator, with field access being linear time and requiring string comparisons rather than just indexing to a known memory offset.
I.e. it seems like simdjson trades off performance at access time for making the parsing faster. Whether that tradeoff is good depends on the access pattern.
- touisteur 5 years ago
  
  But the same could be true for protobuf. Decode fields only when you need them, and 'parse' just to find the field boundaries and cardinality. Did stuff like that for internal protobuf-like tool and with precomputed message profiles you can get amazing perf. Just get the last or first bit of most bytes (vgather if not on AMD) and you can do some magic.
beached_whale 5 years ago

You cannot compare them. Decoding JSON is compressing it essentially, and in order to compare you would need to look at the resulting data structures and how long it takes. strings are comparable, but an integer is bigger exp2( log10( digit_count ) ) I think.
But yeah, I had the same first thought too.
secondcoming 5 years ago
What benefit would length encoded headers provide other than to reduce payload size? With JSON you just have to scan for whitespace, whereas with protobuf you actually have to decode the length field.
- tyingq 5 years ago
  
  Not having to look at every byte to split out the contents.
  
  1 reply →

rapidlua 5 years ago

I’ve opened an issue in LLVM bugzilla concerning jump not being folded with address computation on x86 with a proposed fix. Would love if it gets some attention. https://bugs.llvm.org/show_bug.cgi?id=50042

Also been working on a C language extension to enable guaranteed tail calls along with explicit control over registers used for argument passing. Provided that callee-save registers are used for arguments, calling fallback functions incurs no overhead.

https://github.com/rapidlua/barebone-c

haberman 5 years ago
Barebone C looks really interesting! I wonder if the same gains could be achieved in a more architecture-independent way by simply introducing a new calling convention where parameters are passed in the registers that are normally callee-save.
This would let the fast path functions use all registers without needing to preserve any of them. It would also let them call fallback functions that use a normal calling convention without needing to spill the parameters to the stack.
- rapidlua 5 years ago
  
  Thank you for the feedback! A new calling convention could probably nail it for many use cases. Sometimes you want something very special though. E.g. LuaJIT pre-decodes instruction prior to dispatch to squeeze some work into branch misprediction delay. There are limited callee save registers available hence it is better to use volatile registers for instruction arguments. It should be possible to reuse the implementation on different architectures and only tweak the register assignment.

z77dj3kl 5 years ago

Very cool! OT but is there a list of protobuf benchmarks somewhere across languages/implementations? How does it compare to JSON or just dumping raw bytes?

a-dub 5 years ago

even if you set aside the claimed gains over handwritten imperative loops, being able to tell the compiler to fail unless tail call conversion succeeds will be huge in terms of preventing performance or stack depth bugs from surfacing when someone accidentally makes a change that makes a tail call no longer convertible.

if you ask me, it should be part of the language. the whole "just make sure you shouldn't need a stack and the compiler will probably take care of it" approach has bothered me for years.

alfiedotwtf 5 years ago
> the whole "just make sure you shouldn't need a stack and the compiler will probably take care of it" approach has bothered me for years.
Good point. I love Rust’s safety guarantees, but you’re right - it doesn’t take into account stack like everyone else
- alfiedotwtf 5 years ago
  
  ... to the down-voters, I meant Rust doesn't take into account stack calls just like most languages

CyberRabbi 5 years ago

Interesting but it seems like the essential function of the tail call in this example is to create a strong hint to the compiler which values should stay in registers. Isn’t there a better way to do that than relying on a seemingly unrelated optimization?

cryptonector 5 years ago

Looking at the LLVM docs for attributes I get the impression that tail call optimization (TCO) only happens when the caller and callee have basically the same kind of arguments. You can see how that's real simple to implement: emit a `jmp` and done. For non-compatible tail calls the compiler would have to emit instructions to fix up the current call frame, and I guess the code for that doesn't exist in LLVM.

I wonder if that's typical of TCO implementations.

alfiedotwtf 5 years ago

I’ve never thought of tail calls in depth before, so quick question about them - the point here I’m getting is to minimise the pushing/popping of registers to the stack from `call` By using a `jmp` instead. That’s all good... but what stack frame does the jumped procedure use? I’m assuming a new stack frame is still created but just without caller’s pushed registers?

dreamcompiler 5 years ago

There's always some stack frame sitting there; typically the one that called the main loop routine in the first place. At some point a RET instruction will be executed, and control will return to that stack frame's return point. The idea of tail calls is to avoid creating a new stack frame when it doesn't add any value, but stack frames still exist--they have to exist when the last thing that happens in a routine is ordinary instruction processing rather than a CALL.

londons_explore 5 years ago

This article suggests a big reason to use this approach is to seperate hot and cold code.

I assume that's for good use of the CPU's instruction and microcode caches.

Yet in a protobuf parser, I'm surprised there is enough code to fill said caches, even if you put the hot and cold code together. Protobuf just isn't that complicated!

Am I wrong?

haberman 5 years ago
> I assume that's for good use of the CPU's instruction and microcode caches.
I don't think that is the reason. These are microbenchmark results, where realistically all the code will be hot in caches anyway.
The problem is that a compiler optimizes an entire function as a whole. If you have slow paths in the same function as fast paths, it can cause the fast paths to get worse code, even if the slow paths are never executed!
You might hope that using __builtin_expect(), aka LIKELY()/UNLIKELY() macros on the if statements would help. They do help somewhat, but not as much as just putting the slow paths in separate functions entirely.
- touisteur 5 years ago
  
  Similar if using the 'hot' attribute: separate functions or separate labels... https://stackoverflow.com/a/15034687
tooltower 5 years ago
In particular, the author is talking about CPU registers being spilled to memory, and the need for setting up or tearing down stack frames. Those things can only be eliminated by the compiler for extremely simple functions. The error-handling often isn't.
- astrange 5 years ago
  
  If you really care about performance there’s no particular reason the whole stack has to be set up at function beginning/end. Compilers are not flexible enough about this, or other things like how they won’t store values in the frame pointer register on i386.
  I believe the optimization to do this is called “shrink-wrapping”.

tux1968 5 years ago

This is a really great result, but i'm curious why this isn't a very common and standard compiler optimization, at least as an option you can enable? It seems like the conditions where it can be applied are pretty easy to identify for a compiler.

haberman 5 years ago
Tail calls are a very common optimization, both Clang and GCC have been performing this optimization successfully for a while. What is new is getting a guarantee that applies to all build modes, including non-optimized builds.
- tux1968 5 years ago
  
  If you're interested in this optimization for performance reasons, why would you want an otherwise non-optimized build? It seems that the only important case is the optimized build... where for some reason you're not getting this optimization without explicitly asking for it.
  So the question remains... why is the compiler optimization missing this chance to optimize this tail call without it being explicitly marked for optimization?
  
  13 replies →

soulbadguy 5 years ago

I find the parallel between parsing and interpretation fascinating, and something i have never though about before. Is a general knowledge, any insight wether or not this parallel runs deeper than just implementation similarities ?

pornel 5 years ago

Decompression is another task equivalent to interpreting bytecode, and this similarity is well-known.

The_rationalist 5 years ago

Kotlin also has compile time guaranteed tailcall with the tailrec function qualifier.

ackfoobar 5 years ago

tailrec only handles "tail" recursion, i.e. tail calls to itself. This is far weaker than general tail "call" optimization required by the technique in the article.

coolreader18 5 years ago

It's funny, wasm3 was just on the front page with a tail-call based interpreter model and now this is. Now I wanna do some stuff with tail calls :)

toast0 5 years ago

Write some code in a functional language... You will have a hard time avoiding tail calls.

SavantIdiot 5 years ago

Just here to say that the use of those #defines for huge parameter lists makes sad. I realize that's a common pattern, but if your call list is that big, how about a struct pointer?

haberman 5 years ago
That would defeat the entire purpose: they need to be six parameters to ensure that they are all in registers.
However I was considering whether they could be a struct with six members passed by value. If the ABI would pass that in six registers we could get rid of the #defines, which I agree would be nicer.
- ufo 5 years ago
  
  Unfortunately, I think that only structs with at most 2 words are passed via registers. Larger structs are passed via the stack.
  https://godbolt.org/z/frof3xjhW
  
  1 reply →

stochastimus 5 years ago

does this mean that goto is not always considered harmful? ;-)

khiner 5 years ago

This was a great read!

_3u10 5 years ago

Aren’t protobufs supposed to be faster than JSON?

I mean congrats but it doesn’t seem that impressive given JSON has been parsing at that speed for years.

PostThisTooFast 5 years ago

One thing that appealed to us about protobufs is their alleged change-tolerance (versioning).

Years later, this remains only alleged... not realized.

zozbot234 5 years ago

Is this really higher performance than the existing protobuf support in Rust+serde? That uses macro programming to generate code at compile time based on a high-level description of your data format, so it can be quite fast and will certainly be a lot safer than cowboy-coding raw C.

linkdd 5 years ago
Every single time there is a link about C/C++, I do a quick "Ctrl+F Rust", and every single time there is some rant about "but Rust is safer than C".
Every. Single. Time.
This article is about a new Clang extension:
> An exciting feature just landed in the main branch of the Clang compiler. Using the [[clang::musttail]] or __attribute__((musttail)) statement attributes, you can now get guaranteed tail calls in C, C++, and Objective-C.
(literally the first line of the article)
What does Rust have to do with this? Nothing.
Does the author suggest that you should only use C for high performance? No.
- littlestymaar 5 years ago
  
  > Every single time there is a link about C/C++, I do a quick "Ctrl+F Rust", and every single time there is some rant about "but Rust is safer than C".
  If you had the same reflex with other languages when reading other HN comment thread you'll realize that everytime there's a thread about language A there's a ton of comments about language B or C. Rust threads are full of people talking about C++ or C, why is it supposed to be more OK? Go threads are full of comment about Java, Python threads are full of Go or Julia (depending on what kind of Python this is about). This isn't specific to Rust in any ways.
  Yes, the GP isn't the most useful comment ever, Rust is hype and there are overly enthusiastic people, but there are also people who seem to be overly defensive about it (doing a text search to count Rust occurrences, every time, really?) and I'm not sure the latter is less childish than the former.
  
  2 replies →
- zozbot234 5 years ago
  
  > This article is about a new Clang extension:
  > An exciting feature just landed in the main branch of the Clang compiler. Using the [[clang::musttail]] or __attribute__((musttail)) statement attributes, you can now get guaranteed tail calls in C, C++, and Objective-C.
  > What does Rust have to do with this? Nothing.
  Rust has a planned keyword for this exact feature, namely 'become'. So the author would be able to use it in Rust just as well, as soon as support for it lands in an upstream LLVM release.
  Regardless, writing raw C as the article suggests in order to parse a quasi-standard high-level format is cowboy coding. It's a nice hack to be sure, but it's not the kind of code that should be anywhere close to a real production workload. Instead, this feature should be implemented as part of some safe parser generator. Not necessarily written in Rust but something that's at least as safe.
  
  5 replies →