← Back to context

Comment by eqvinox

6 hours ago

InfiniBand is its entire own networking standard, if you have Mellanox NICs you can switch them into IB mode and... short version, it's not Ethernet anymore. It's not even the same speeds/baud rates (e.g. there is a FDR rate at 14.0625Gbaud.) (NB: InfiniBand is indeed not RoCE, that E is Ethernet. InfiniBand had RDMA way before RoCE became a thing; probably why its APIs are being used for it.)

It sounds like you're really just doing the IB verbs (which is kinda really RDMA verbs). I don't think any kind of "bridging" (other than IP routing) is really possible (you'd need a chip that understands both TB and IB and can somehow translate RDMA requests between the two.)

Ah right, yes- I think we're talking about the same thing- this driver just chooses to pretend to be a RoCE v2 device (instead of e.g MLX Nic in IB mode), but nothing would change if it did I think. Or at least thats what the libibverbs abstraction promises.

There's no IB OR Ethernet underneath- I could have implemented this properly as it's own distinct transport kind, but seemed easier just to pretend to be something that is already known.

The 'the chip that understands both TB and IB and translate RDMA requests between the two' in this instance is your CPU, so orders-of-magnitude worse latency than an ASIC, but still better than anything on top of IP/Ethernet. I think there's also potential to do device-initiiated RDMA, where e.g GPU itself can write to some mailbox and have message appear across the abstracted transport in another GPUs mailbox. Even if the CPU is involved in shuffling pointers across mailboxes it doesn't necessarily mean it'll be a bottleneck