Comment by haberman
5 years ago
Zero parse wire formats definitely have benefits, but they also have downsides such as significantly larger payloads, more constrained APIs, and typically more constraints on how the schema can evolve. They also have a wire size proportional to the size of the schema (declared fields) rather than proportional to the size of the data (present fields), which makes them unsuitable for some of the cases where protobuf is used.
With the techniques described in this article, protobuf parsing speed is reasonably competitive, though if your yardstick is zero-parse, it will never match up.
Situations where wire/disk bandwidth are constrained are usually better served by compressing the entire stream rather than trying to integrate some run encoding into the message format itself.
You only need to pay for decompression once to load the message into ram rather than being forced to either make a copy or pay for decoding all throughout the program whenever fields are accessed. And if the link is bandwidth constrained then the added latency of decompression is probably negligible.
The separation of concerns between compression format and encoding also allows specifically tuned compression algorithms to be used, for example like when switching zstd's many compression levels. Separating the compression from encoding also lets you compress/decompress on another processor core for higher throughput.
Meanwhile you can also do a one shot decompression or skip compression of a stream for replay/analysis; when talking over a low latency high bandwidth link/IPC; or when serializing to/from an already compressed filesystem like btrfs+zstd/lzo.
It's just more flexible this way with negligible drawbacks.
Recently I've been looking at CapnProto which is a fixed offset/size field encoding that allows for zero copy/allocation decoding, and arena allocation during message construction.
One nice design choice it has is to make default values zero on the wire by xor'ing all integral fields with the field default value.
This composes well with another nice feature it has, which is an optional run-length style packed encooding that compresses these zero bytes down. Overall, not quite msgpack efficiency but still very good.
One even more awesome feature is you can unpack the packed encoding without access to the original schema.
Overall I think it's a well designed and balanced feature set.
I’ve been using CapnProto, and while I like it, it certainly has a small community, and support can suffer due to that. I haven’t tried it, but have heard good thing about flatbuffers, and would def give that a second look if I were to make the decision again.
1 reply →
Few years back, i actually compared the three serializations, for the data i used ironically raw struct came on top for every benchmark https://cloudef.pw/protobug.png
3 replies →