Comment by lmeyerov

4 years ago

I don't know about TPUs, but in GPU land, yeah, you can doing fast GPU<>GPU transfers without much code, incl. for dask kernels. More typically, the code is automatically optimized enough without doing manual optimization here, and at least for us, we end up spending our optimization time elsewhere.

I don't remember what's normal for direct GPU<>GPU, but for many cases we see, the occasions we've done it is through a special pinned memory mode through a staging area. That used to be hard, but nowadays with the rapids.ai ecosystem (cupy / rmm / etc), nice python wrappers.

Dask is part of that ecosystem ("dask-cudf"), but helps more w/ automation around bigger-than-memory paging, multi-gpu dispatch, and multi-node dispatch. Underneath, it does some nice things for you, like setting CPU<>GPU affinities. However, doing custom peer-to-peer / NUMA stuff quickly gets you back to cupy/rmm, and thankfully, that's Python and integrates nicely with pandas/arrow/ etc :)

EDIT: There's a wild world of fancy NVLink GPU<>GPU interconnects in more expensive multigpu boxes. We mostly deal with more end-to-end IO issues like network/SSD array ->PCI cards->GPUs as the I/O bottleneck, such as during bigger-than-memory and oneshot/cold use, so I can't speak to the p2p bits as much.