← Back to context

Comment by CMCDragonkai

4 years ago

Can this be done with Dask?

I'm not sure. I thought it was nothing short of a miracle that it could be done at all. I tried, hard, in Tensorflow, to make it work. But there was also no way to communicate directly from TPU to TPU; I had to serialize to GCE buckets as a middle step, which added massive complexity to the code.

The ray solution here is so simple. It was a joy to use. But I don't know anything about Dask.

By the way, if anyone wants to see how the loss graphs turned out: https://twitter.com/theshawwn/status/1406171487988498433

(I wish I'd uploaded the logs to tensorboard.dev for posterity. But you can see quite clearly in the screenshots all the information you'd want to see anyway, with apologies to blind engineers. Oh lord, is there a single blind engineer working in ML? Suddenly it's an appealing idea to try to make tensorboard accessible... I wonder how blind people could interpret graphs. Averages, probably.)

I don't know about TPUs, but in GPU land, yeah, you can doing fast GPU<>GPU transfers without much code, incl. for dask kernels. More typically, the code is automatically optimized enough without doing manual optimization here, and at least for us, we end up spending our optimization time elsewhere.

I don't remember what's normal for direct GPU<>GPU, but for many cases we see, the occasions we've done it is through a special pinned memory mode through a staging area. That used to be hard, but nowadays with the rapids.ai ecosystem (cupy / rmm / etc), nice python wrappers.

Dask is part of that ecosystem ("dask-cudf"), but helps more w/ automation around bigger-than-memory paging, multi-gpu dispatch, and multi-node dispatch. Underneath, it does some nice things for you, like setting CPU<>GPU affinities. However, doing custom peer-to-peer / NUMA stuff quickly gets you back to cupy/rmm, and thankfully, that's Python and integrates nicely with pandas/arrow/ etc :)

EDIT: There's a wild world of fancy NVLink GPU<>GPU interconnects in more expensive multigpu boxes. We mostly deal with more end-to-end IO issues like network/SSD array ->PCI cards->GPUs as the I/O bottleneck, such as during bigger-than-memory and oneshot/cold use, so I can't speak to the p2p bits as much.

I've not trained model using Dask, But I've used it for distributed computing exploration over local network for some data science workload. I found that dask much more stable when compared with Ray, Modin for multi-architecture distributed computing i.e. nodes with different CPU arch - ARMv8, x86_64.

My goal was to explore the extent of distributed computing using local low power compute nodes where different architectures are common and not to be compared with professional work like gp has detailed with homogeneous architectures.

But in case you'd like to indulge in similar masochist activities, I have couple of gists like installing Ray on ARM[1], Apache Arrow on ARM[2].

[1] https://gist.github.com/heavyinfo/aa0bf2feb02aedb3b38eef203b...

[2] https://gist.github.com/heavyinfo/04e1326bb9bed9cecb19c2d603...