Comment by grw_

8 hours ago

I actually didn't know there was more to InfiniBand than verbs (at least at this abstraction level, above PHY), so probably the answer is 'not much more'. The device imitates a RoCE V2 device and the higher level abstractions I used on top were GPU-ish libraries like NCCL and JACCL.

Good q about 'bridging into actual InfiniBand', I don't know the answer there either. My naive understanding would be that: since this is host-initiated RDMA (it's still the host cpu invoking into dma buffers, though they may be device-memory mapped), actually it should work fine, at least between two machines? I'm curious enough to try- I have a couple of machines with thunderbolt AND RoCE-capable NICs- the experiment is to see if we can use this across diverse transports simultaneously? I think this is what it does already (since the MacOS FA57 vs linux native are already 'different transports'), but say if you have a better scenario to demonstrate what 'bridging into actual infiniband' would look like!

2 comments

grw_

eqvinox 6 hours ago

InfiniBand is its entire own networking standard, if you have Mellanox NICs you can switch them into IB mode and... short version, it's not Ethernet anymore. It's not even the same speeds/baud rates (e.g. there is a FDR rate at 14.0625Gbaud.) (NB: InfiniBand is indeed not RoCE, that E is Ethernet. InfiniBand had RDMA way before RoCE became a thing; probably why its APIs are being used for it.)

It sounds like you're really just doing the IB verbs (which is kinda really RDMA verbs). I don't think any kind of "bridging" (other than IP routing) is really possible (you'd need a chip that understands both TB and IB and can somehow translate RDMA requests between the two.)

grw_ 4 hours ago

Ah right, yes- I think we're talking about the same thing- this driver just chooses to pretend to be a RoCE v2 device (instead of e.g MLX Nic in IB mode), but nothing would change if it did I think. Or at least thats what the libibverbs abstraction promises.
There's no IB OR Ethernet underneath- I could have implemented this properly as it's own distinct transport kind, but seemed easier just to pretend to be something that is already known.
The 'the chip that understands both TB and IB and translate RDMA requests between the two' in this instance is your CPU, so orders-of-magnitude worse latency than an ASIC, but still better than anything on top of IP/Ethernet. I think there's also potential to do device-initiiated RDMA, where e.g GPU itself can write to some mailbox and have message appear across the abstracted transport in another GPUs mailbox. Even if the CPU is involved in shuffling pointers across mailboxes it doesn't necessarily mean it'll be a bottleneck