Comment by ma2rten

5 years ago

Protobufs are very important for Google. A significant percentage of all compute cycles is used on parsing protobufs. I am surprised that the parsing is not doing using handwritten assembly if it's possible to improve performance so much.

49 comments

ma2rten

CoolGuySteve 5 years ago

Protobuf's abysmal performance, questionable integration into the C++ type system, append-only expandability, and annoying naming conventions and default values are why I usually try and steer away from it.

As a lingua franca between interpreted languages it's about par for the course but you'd think the fast language should be the fast path (ie: zero parsing/marshalling overhead in Rust/C/C++, no allocations) as you're usually not writing in these languages for fun but because you need the thing to be fast.

It's also the kind of choice that comes back to bite you years into a project if you started with something like Python and then need to rewrite a component in a systems language to make it faster. Now you not only have to rewrite your component but change the serialization format too.

Unfortunately Protobuf gets a ton of mindshare because nobody ever got fired for using a Google library. IMO it's just not that good and you're inheriting a good chunk of Google's technical debt when adopting it.

haberman 5 years ago
Zero parse wire formats definitely have benefits, but they also have downsides such as significantly larger payloads, more constrained APIs, and typically more constraints on how the schema can evolve. They also have a wire size proportional to the size of the schema (declared fields) rather than proportional to the size of the data (present fields), which makes them unsuitable for some of the cases where protobuf is used.
With the techniques described in this article, protobuf parsing speed is reasonably competitive, though if your yardstick is zero-parse, it will never match up.
- CoolGuySteve 5 years ago
  
  Situations where wire/disk bandwidth are constrained are usually better served by compressing the entire stream rather than trying to integrate some run encoding into the message format itself.
  You only need to pay for decompression once to load the message into ram rather than being forced to either make a copy or pay for decoding all throughout the program whenever fields are accessed. And if the link is bandwidth constrained then the added latency of decompression is probably negligible.
  The separation of concerns between compression format and encoding also allows specifically tuned compression algorithms to be used, for example like when switching zstd's many compression levels. Separating the compression from encoding also lets you compress/decompress on another processor core for higher throughput.
  Meanwhile you can also do a one shot decompression or skip compression of a stream for replay/analysis; when talking over a low latency high bandwidth link/IPC; or when serializing to/from an already compressed filesystem like btrfs+zstd/lzo.
  It's just more flexible this way with negligible drawbacks.
  
  7 replies →
lmeyerov 5 years ago
We jumped from protobuf -> arrow in the very beginning of arrow (e.g., wrote on the main lang impls), and haven't looked back :)
if you're figuring out serialization from scratch nowadays, for most apps, I'd def start by evaluating arrow. A lot of the benefits of protobuf, and then some
- cookguyruffles 5 years ago
  
  I've played with Arrow a bunch of times and have yet to figure out what it's intended for precisely. (Not joking)
  
  6 replies →
Cloudef 5 years ago

The protobuf itself as format isnt that bad, just the default implementations are bad. Slow compile times, code bloat and clunky apis / conventions. Nanopb is much better implementation and allows you to control code generation better too. Protobuf makes sense for large data, but for small data fixed length serialization with compression applied on top probably would be better.
gorset 5 years ago
It's obviously possible to do protobuf with zero parsing/marshalling if you stick to fixed length messages and 4/8 byte fields. Not saying that's a good idea, since there are simpler binary encodings out there when you need that type of performance.
- ori_b 5 years ago
  
  This is incompatible with protobuf. Protobuf has variable length encodings for all its integers, including field tags.
  https://developers.google.com/protocol-buffers/docs/encoding
  
  1 reply →
fnord123 5 years ago
FWIW, the python protobuf library defaults to using the C++ implementation with bindings. So even if this is a blog post about implementing protobuf in C, it can also help implementations in other languages.
But yes, once you want real high performance, protobuf will disappoint you when you benchmark and find it responsible for all the CPU use. What are the options to reduce parsing overhead? flatbuffers? xdr?
- TheGuyWhoCodes 5 years ago
  
  Flatbuffers, Cap'n'Proto and Apache Arrow comes to mind.

xxpor 5 years ago

Handwritten ASM for perf is almost never worth it in modern times. C compiled with GCC/Clang will almost always be just as fast or faster. You might use some inline ASM to use a specific instruction if the compiler doesn't support generating it yet (like AVX512 or AES), but even for that there's probably an intrinsic available. You can still inspect the output to make sure it's not doing anything stupid.

Plus it's C so it's infinitely more maintainable and way more portable.

astrange 5 years ago
The x86 intrinsics are so hard to read because of terrible Wintel Hungarian naming conventions that I think it’s quite clearer to write your SIMD in assembly. It’s usually easy enough to follow asm if there aren’t complicated memory accesses anyway. The major issue is not having good enough debug info.
- xxpor 5 years ago
  
  I honestly don't think I've seen native windows code in over 20 years at this point. Obviously there's a ton of C++ out there, it's just basically as far away from me as possible.
ma2rten 5 years ago

But this seems to be an edge case where you have to rely on functional programming and experimental compiler flags to get the machine code that you want.
Portability is typically not a big issue, because you can have a fallback C++ implementation.

pjmlp 5 years ago

Yet Microsoft was able to make a .NET implementation faster than the current Google's C++ one.

A proof that they don't care enough about protobuf parsing performance.

https://devblogs.microsoft.com/aspnet/grpc-performance-impro...

haberman 5 years ago
From your link:
> Support for Protobuf buffer serialization was a multi-year effort between Microsoft and Google engineers. Changes were spread across multiple repositories.
The implementation being discussed there is the main C# implementation from https://github.com/protocolbuffers/protobuf/tree/master/csha...
- pjmlp 5 years ago
  
  I know, the point was that they don't care to improve the C++ version.
  
  5 replies →

barbazoo 5 years ago

In what kind of scenarios do they use Protobufs? I can think of messaging systems, streaming, RPC, that sort of thing?

sour-taste 5 years ago
Everything at google is built on RPCs between systems using protobufs
- ma2rten 5 years ago
  
  Protobuf is also used as a format to write data to disk.
  
  1 reply →
- randomswede 5 years ago
  
  And one of the more important things done at the very beginning of each and every HTTP(S) request is packaging the web request up in a protobuf and send it onwards from the loadbalancer.
  
  2 replies →
- TheReveller 5 years ago
  
  Also extensively for database column data types.

mkoubaa 5 years ago

Given that fact I'm wondering if google ever researched custom chips or instruction sets for marshalling pbs, like the TPUs they worked on for ML.

lupire 5 years ago
Problem is once you parse the protobuf, you have to immediately do other computations on it in the same process. No one needs to parse protobufs all day long like an ML model or doing hashes for crypto.
- jeffbee 5 years ago
  
  That doesn't seem to preclude hardware assistance. For example they have also explored hardware acceleration for the tcmalloc fast path allocation by adding circuits to general purpose CPUs. Arguably, Intel BMI descends from a request that Google made years ago to speed up protobuf decoding (and other workloads) on x86.

jlouis 5 years ago

I'm going to guess most of the time is being spent elsewhere in the systems they are looking at and it is rather rare they have a situation where the parser is dominating. Protobuf is already a winner compared to the JSON mess we're in.

PostThisTooFast 5 years ago

So important that they haven't bothered to create a protobuf generator for Kotlin, the primary development language for their own mobile operating system.