Comment by bleepblap
12 days ago
I think you might be swapping RDMA with RoCE - RDMA can happen entirely within a single node. For example between an NVME and a GPU.
12 days ago
I think you might be swapping RDMA with RoCE - RDMA can happen entirely within a single node. For example between an NVME and a GPU.
Within a single node it's just called DMA. RDMA is DMA over a network and RoCE is RDMA over Ethernet.
Sorry, but it certainly isn't--
https://docs.nvidia.com/cuda/gpudirect-rdma/index.html
The "R" in RDMA means there are multiple DMA controllers who can "transparently" share address spaces. You can certainly share address spaces across nodes with RoCE or Infiniband, but thats a layer on top
I don't know why that NVIDIA document is wrong, but the established term for doing DMA from eg. an NVMe SSD to a GPU within a single system without the CPU initiating the transfer is peer to peer DMA. RDMA is when your data leaves the local machine's PCIe fabric.
I'm going to agree to disagree with Nvidia here.