Comment by exabrial

7 months ago

Alright, I have a dumb question...

How come with a LAG group on ethernet, I can get "more total bandwidth", but any single TCP flow is limited to the max speed of one of the LAG Components (gigabit lets say), but then these guys are somehow combining multiple fibers into an overall faster stream? What gives? Even round robin mode on LAG groups doesn't do that.

What are they doing differently and why can't we do that?

I do not know exactly what is being done here, but I can say that I am aware of two techniques for sending bit streams over parallel wires while keeping the bits in order:

1. The lengths of all wires are meticulously matched so that signals arrive at the same time. Then the hardware simply assumes that the bits coming off each wire are in sequence by wire order. This is done in computers for high speed interfaces such as memory or graphics. If you have ever seen squiggly traces on a PCB going to a high speed device, they were done to get the lengths to be exactly the same so the signals arrive at the same time across each. This is how data transfers from dual channel DDR4 RAM where 64 bits are received simultaneously occur without reordering bits.

2. The lengths of wires are not matched and may be of different lengths up to some tolerance. Deskew buffers then are used to emulate matched lengths. In the case of twisted pair Ethernet, the wire pairs are not equal length because the twist rates vary to avoid interference from having the same twist rates. The result is the Ethernet PHY must implement a deskew buffer to compensate for the mismatched lengths and present the illusion of the wire lengths being matched. This is part of the Ethernet standard and likely applies to Ethernet over fiber too. The IEEE has a pdf talking about this on 800 Gb/sec Ethernet:

https://www.ieee802.org/3/df/public/23_01/0130/ran_3df_03_23...

LAG was never intended to have the sequence in which data is sent be preserved, so no effort was made to enable that in the standard.

That said, you would get a better answer from an electrical engineer, especially one that builds networking components.

  • I just noticed a typo in this. I should have written 128 bits, not 64 bits. Data transfers in dual channel DDR4 are 128-bits at a time.

> What are they doing differently and why can't we do that?

You're (incorrectly) assuming they're doing Ethernet/IP in that test setup. They aren't (this is implied by the results section discussing various FEC, which is below even Ethernet framing), so it's just a petabit of raw layer 1 bandwidth.

  • It's also important to note that many optical links don't use ethernet as a protocol either (SDH/SONET are the common ones), although this is changing more and more.

You don't really want to, but if you configure all of the LAG participants on the path to do round-robin or similar balancing rather than hashing based on addresses, you can have data in one flow that exceeds an individual connection. You'll also be pretty likely to get out of order data, and tcp receivers will exercise their reassembly buffer, which will kill performance and you'll rapidly wish you hadn't done all that configuration work. If you do need more than one link's worth of throughput, you'll almost always do better by running multiple flows, but you may need still need to configure your network so it hashes in a way that you can get diverse paths between two hosts, defaults might not give you diversity even on different flows.

  • the data out of order is the key bit.

    How do these guys get the data in order and we dont?

    • Consider that a QSFP28 module uses four 25gbps lanes to support sending one single 100gbps flow. So electronics do exist that can easily do what you are asking. I think it is just the economics of doing it for the various ports on a switch, lack of a standard, etc.

      4 replies →

    • > How do these guys get the data in order and we dont?

      LAGs stripe traffic across links at the packet level, whereas QSFP/OSFP lanes do so at the bit level.

      Different sized packets on different LAG links will take different amounts of time to transmit. So when striping bits, you effectively have a single ordered queue, whereas when striping packets across links, there are multiple independent queues.

Because your switch is mapping a 4 tuple to a certain link and these people aren't, is my guess.

They're not combining anything, they're sending 19 copies of one signal down 19 strands (with some offsets so they interfere in awkward ways), applying some signal processing to correct the interference which they say makes it realistic, and declaring that they've calculated the total capacity of the medium.

What you do with it at the higher layers is entirely up to you.

But, ethernet could totally do that, by essentially demuxing the parts of an individual packet and sending them in parallel across a bunch of links, and remuxing them at the other end. I'm not aware of anyone having bothered to implement it.

I assume this is just a PHY-level test and no real switches or traffic was involved.