Comment by lmeyerov
5 years ago
We jumped from protobuf -> arrow in the very beginning of arrow (e.g., wrote on the main lang impls), and haven't looked back :)
if you're figuring out serialization from scratch nowadays, for most apps, I'd def start by evaluating arrow. A lot of the benefits of protobuf, and then some
I've played with Arrow a bunch of times and have yet to figure out what it's intended for precisely. (Not joking)
Interchange format for tabular data.
Think pandas in python, but language agnostic.
Yep; fast+safe+well-tooled common IO format for tabular data, especially columnar, for going across runtimes/languages/processes/servers with zero copy. It supports a variety of common types, including nested (ex: lists, records, ...), modes like streaming, and schemas that can be inferred+checked, so it covers a lot of classic protobuf messaging use cases.
When messages look more like data, even better, as columnar representations are often better for compression & compute. So it's pretty common now for data frameworks, and increasingly, DB connectors. Ex: Our client https://github.com/graphistry/pygraphistry takes a bunch of formats, but we convert to Arrow before sending to our server, we zero-copy it between server-side GPU processes, and stream it to WebGL buffers in the client, all without touching it (beyond our own data cleaning).
The main reason I'd stick to protobuf is legacy, especially awkward data schemas / highly sparse data, and I'm guessing, Google probably has internal protobuf tooling that's faster+easier+better tooled than the OSS stuff.
4 replies →