Comment by riobard
17 days ago
Can someone explain how UDP GSO/GRO works in detail? Since UDP packets can arrive out-or-order, how does a single large QUIC packet be split into multiple smaller UDP packets without any header sequence number, and how does the receiving side know the order of the UDP packets to merge?
Author here.
QUIC does not depend on UDP datagrams to be delivered in order. Re-ordering happens on the QUIC layer. Thus, when receiving, the kernel passes a batch (i.e. segmented super datagram) of potentially out-of-order datagrams to the QUIC layer. QUIC reorders them.
Maybe https://blog.cloudflare.com/accelerating-udp-packet-transmis... brings some clarity.
Thanks! The Cloudflare blog article explained GSO pretty well: application must send a contiguous data buffer with a fixed segment size (except for the tail of the buffer) for GSO to split into smaller packets. But how does GRO work on the receiving side?
For example GSO might split a 3.5KB data buffer into 4 UDP datagrams: U1, U2, U3, and U4, with U1/U2/U3 being 1KB and U4 being 512B. When U1~4 arrives on the receiving host, how does GRO deal with the different permutations of orderingof the four packets (assuming no loss) and pass them to the QUIC layer? Like if U1/U2/U3/U4 come in the original sending order GRO can batch nicely. But what if they come in the order U1/U4/U3/U2? How does GRO deal with the fact that U4 is shorter?
It will deliver two separate batches. One of U1 & U4 and a 2nd one of U3 & U2. `quinn-udp` in particular also uses recvmmsg and is thus able to receive up to 32 different permutations of src, dst and segment length with a single syscall (assuming the application provides enough buffers).
1 reply →
I think as an application, when receiving packets you never really see a coalesced UDP datagrams when GRO is active.
It’s more like the kernel puts multiple datagrams into a single structure and passes that around between layers, maintaining the boundaries between them in that structure (sk_buff data fragments?)
Not an expert, but I tried looking at how this works and stumbled upon [0].
[0]: https://lwn.net/Articles/768995/
You definitely see the coalesced datagram as an application. That is kind of the whole point: Passing a big buffer to the syscall and segment it in user-space to minimize the syscall overhead per packet.