Comment by scottlamb

11 hours ago

> now you've re-introduced out-of-order delivery which complicates re-assembly of large packets, retries, handling loss etc.

Still confused though. For a standard TCP/IP networking stack, that support is all there anyway, as it's not meant for point-to-point links, and out-of-order delivery is a thing that happens on the Internet. I haven't tried thunderbolt-net, but it says it implements Apple's ThunderboltIP, so I'd expect it's IP-based networking on top, and so it'd all work? Is it that out-of-order delivery is far more common than usual, and this path is so much slower (by impairing LRO/GRO) that it's not worth aggregating at all?

I'd understand if each pair is logically represented as a separate networking device, and then you have to set up link aggregation on top of that. (And iirc at least with some forms of aggregation a particular flow is bound to one link, so you'd have to have a bunch of streams to actually get bandwidth benefits.) So caveats for sure but I'd expect something to be possible. But does it just not support using both pairs at all?

Even with using one pair I still don't understand why you'd only get about 10G rather than 20G on a pair. I do see chapter 4 of the (your?) article talks about the single DMA ring maybe imposing the 10 Gbps limit but I don't have any good intuition for why. I don't know say how large the rings are or what latencies to expect on their operations or what packet sizes are supported which might help me understand.

1 comment

scottlamb

grw_ 9 hours ago

Yeah, thunderbolt-net is IP on top and it does work as you say, with a few caveats:

- On a single cable with two rails available, the thunderbolt-net grabs one and uses that. Without patching the kernel, there's no way to make it present a second interface using the remaining pair.

- If you had a second cable between the machines (for 4 total rails), thunderbolt-net will still only grab one rail, because the abstraction across which it's making the links sees an identical peer at the end of both links and so falls into the same trap as above. There is no LRO/GRO anyway (or it's buggy- I forget) on the linux version.

- Why you only get 10G rather than 20G on single pair- actually, this might be something specific to the Strix Halo SoC that I was testing on- on a different (still AMD) chipset and an Apple TB5 Mac I did see closer to 22G in one direction, but still 8 in the other. The Strix Halo NHI seems to be 'stripped down' (as expected, for mobile) in ways I don't really understand.

- Intuition on why- I can't point you to the line number, but I think it has to do with a fixed 4kb page size when communicating with the NHI that ends up becoming a bottleneck, perhaps 16kb pages on aarch64 apple help here?