← Back to context

Comment by infinite8s

5 years ago

Are you leveraging arrow-flight to stream between server-side to client side? Or some custom protocol embedding the arrow data as a binary blob?

I'm probably way out of date, but arrow-flight seems primarily innovative for scaling reads across multiple machines, such as in dask envs

For server->client, arrow's record batch representation is useful for streaming binary blobs without having to touch them, but for the surrounding protocol, we care way more about http/browser APIs and how the protocol works, vs the envs + problems flight was built for. So we built some tuned layers for pushing arrow record batches through GPU -> CPU -> network -> browser. The NodeJS ecosystem is surprisingly good for streaming buffers, so it's a few libs from there, careful protocol use to ensure zero-copy on the browser side, and tricks around compression. Flight, at least at the time, was more in the way than a help for all that, and all arrow sponsor money goes to ursa labs, so we never ended up having the resources to OSS that part, just the core Arrow JS libs. Our commercial side is starting to launch certain cloud layers that may make it more aligned to revisit & OSS those pieces, but for now, still low on the priority list for what our users care about. Our protocol here is about to get upgraded again, but not fundamentally wrt the network/client parts.

Interestingly, for the server case... despite our using dask etc. increasingly on the server, we don't explicitly use arrow-flight, and I don't believe our next moves here will either. I recorded a talk for Nvidia GTC about our early experiments on the push to doing 90+ GB/s per server, and while parquet/arrow is good here, flight still seems more of a distraction: https://pavilion.io/nvidia (with more resources, I think it can be aligned)

  • OK thanks for the insight. I havenv't been tracking arrow-flight recently, but it does seem more geared to an RPC use case than one way streaming.

    I've been playing with pushing arrow batches using (db -> server-side turbodbc -> network -> browser -> webworker using transferable support), leveraging a custom msgpack protocol over websockets, since most databases don't support connectivity from a browser. Not dealing with big data by any means (max a million records or so), and so far it's been fairly performant.

    • yep that's pretty close (and cool on turbodbc, we normally don't get control of that part!)

      for saturating network, afaict, it gets more into tuning around browser support for parallel streams, native compression, etc. we've actually been backing off browser opts b/c of unreliability across users, and thus focus more on what we can do on the server..