Comment by ma2rten

5 years ago

Protobufs are very important for Google. A significant percentage of all compute cycles is used on parsing protobufs. I am surprised that the parsing is not doing using handwritten assembly if it's possible to improve performance so much.

Protobuf's abysmal performance, questionable integration into the C++ type system, append-only expandability, and annoying naming conventions and default values are why I usually try and steer away from it.

As a lingua franca between interpreted languages it's about par for the course but you'd think the fast language should be the fast path (ie: zero parsing/marshalling overhead in Rust/C/C++, no allocations) as you're usually not writing in these languages for fun but because you need the thing to be fast.

It's also the kind of choice that comes back to bite you years into a project if you started with something like Python and then need to rewrite a component in a systems language to make it faster. Now you not only have to rewrite your component but change the serialization format too.

Unfortunately Protobuf gets a ton of mindshare because nobody ever got fired for using a Google library. IMO it's just not that good and you're inheriting a good chunk of Google's technical debt when adopting it.

  • Zero parse wire formats definitely have benefits, but they also have downsides such as significantly larger payloads, more constrained APIs, and typically more constraints on how the schema can evolve. They also have a wire size proportional to the size of the schema (declared fields) rather than proportional to the size of the data (present fields), which makes them unsuitable for some of the cases where protobuf is used.

    With the techniques described in this article, protobuf parsing speed is reasonably competitive, though if your yardstick is zero-parse, it will never match up.

    • Situations where wire/disk bandwidth are constrained are usually better served by compressing the entire stream rather than trying to integrate some run encoding into the message format itself.

      You only need to pay for decompression once to load the message into ram rather than being forced to either make a copy or pay for decoding all throughout the program whenever fields are accessed. And if the link is bandwidth constrained then the added latency of decompression is probably negligible.

      The separation of concerns between compression format and encoding also allows specifically tuned compression algorithms to be used, for example like when switching zstd's many compression levels. Separating the compression from encoding also lets you compress/decompress on another processor core for higher throughput.

      Meanwhile you can also do a one shot decompression or skip compression of a stream for replay/analysis; when talking over a low latency high bandwidth link/IPC; or when serializing to/from an already compressed filesystem like btrfs+zstd/lzo.

      It's just more flexible this way with negligible drawbacks.

      7 replies →

  • We jumped from protobuf -> arrow in the very beginning of arrow (e.g., wrote on the main lang impls), and haven't looked back :)

    if you're figuring out serialization from scratch nowadays, for most apps, I'd def start by evaluating arrow. A lot of the benefits of protobuf, and then some

  • The protobuf itself as format isnt that bad, just the default implementations are bad. Slow compile times, code bloat and clunky apis / conventions. Nanopb is much better implementation and allows you to control code generation better too. Protobuf makes sense for large data, but for small data fixed length serialization with compression applied on top probably would be better.

  • FWIW, the python protobuf library defaults to using the C++ implementation with bindings. So even if this is a blog post about implementing protobuf in C, it can also help implementations in other languages.

    But yes, once you want real high performance, protobuf will disappoint you when you benchmark and find it responsible for all the CPU use. What are the options to reduce parsing overhead? flatbuffers? xdr?

Handwritten ASM for perf is almost never worth it in modern times. C compiled with GCC/Clang will almost always be just as fast or faster. You might use some inline ASM to use a specific instruction if the compiler doesn't support generating it yet (like AVX512 or AES), but even for that there's probably an intrinsic available. You can still inspect the output to make sure it's not doing anything stupid.

Plus it's C so it's infinitely more maintainable and way more portable.

  • The x86 intrinsics are so hard to read because of terrible Wintel Hungarian naming conventions that I think it’s quite clearer to write your SIMD in assembly. It’s usually easy enough to follow asm if there aren’t complicated memory accesses anyway. The major issue is not having good enough debug info.

    • I honestly don't think I've seen native windows code in over 20 years at this point. Obviously there's a ton of C++ out there, it's just basically as far away from me as possible.

  • But this seems to be an edge case where you have to rely on functional programming and experimental compiler flags to get the machine code that you want.

    Portability is typically not a big issue, because you can have a fallback C++ implementation.

Yet Microsoft was able to make a .NET implementation faster than the current Google's C++ one.

A proof that they don't care enough about protobuf parsing performance.

https://devblogs.microsoft.com/aspnet/grpc-performance-impro...

In what kind of scenarios do they use Protobufs? I can think of messaging systems, streaming, RPC, that sort of thing?

Given that fact I'm wondering if google ever researched custom chips or instruction sets for marshalling pbs, like the TPUs they worked on for ML.

  • Problem is once you parse the protobuf, you have to immediately do other computations on it in the same process. No one needs to parse protobufs all day long like an ML model or doing hashes for crypto.

    • That doesn't seem to preclude hardware assistance. For example they have also explored hardware acceleration for the tcmalloc fast path allocation by adding circuits to general purpose CPUs. Arguably, Intel BMI descends from a request that Google made years ago to speed up protobuf decoding (and other workloads) on x86.

I'm going to guess most of the time is being spent elsewhere in the systems they are looking at and it is rather rare they have a situation where the parser is dominating. Protobuf is already a winner compared to the JSON mess we're in.

So important that they haven't bothered to create a protobuf generator for Kotlin, the primary development language for their own mobile operating system.