← Back to context

Comment by kibwen

2 days ago

I'm confused, I thought the revolution of the past decade or so was in moving network stacks to userspace for better performance.

Most QUIC stacks are built upon in-kernel UDP. You get significant performance benefits if you can avoid your traffic going through kernel and userspace and the context switches involved.

You can work that angle by moving networking into user space... setting up the NIC queues so that user space can access them directly, without needed to context switch into the kernel.

Or you can work the angle by moving networking into kernel space ... things like sendfile which let a tcp application instruct the kernel to send a file to the peer without needing to copy the content into userspace and then back into kernel space and finally into the device memory, if you have in-kernel TLS with sendfile then you can continue to skip copying to userspace; if you have NIC based TLS, the kernel doesn't need to read the data from the disk; if you have NIC based TLS and the disk can DMA to the NIC buffers, the data doesn't need to even hit main memory. Etc

But most QUIC stacks don't get benefit from either side of that. They're reading and writing packets via syscalls, and they're doing all the packetization in user space. No chance to sendfile and skip a context switch and skip a copy. Batching io via io_uring or similar helps with context switches, but probably doesn't prevent copies.

  • Yeah, there’s also a lot of offloads that can be done to the kernel with UDP (e.g. UDP segmentation offload, generic receive offload, checksum offload), and offloading quick entirely would be a natural extension to that.

    It just offers people choice for the right solution at the right moment.

You are right but it's confusing because there are two different approaches. I guess you could say both approaches improve performance by eliminating context switches and system calls.

1. Kernel bypass combined with DMA and techniques like dedicating a CPU to packet processing improve performance.

2. What I think of as "removing userspace from the data plane" improves performance for things like sendfile and ktls.

To your point, Quic in the kernel seems to not have either advantage.

  • So... RDMA?

    • No, the first technique describes the basic way they already operate, DMA, but giving access to userspace directly because it's a zerocopy buffer. This is handled by the OS.

      RDMA is directly from bus-to-bus, bypassing all the software.

The constant mode switching for hardware access is slow. TCP/IP remains in the kernel for windows and Linux.

You still need to offload your bytes to a NIC buffer. Either you can do something like DMA where you get privileged space to write your bytes to that the NIC reads from or you have to cross the syscall barrier and have your kernel write the bytes into the NIC's buffer. Crossing the syscall barrier adds a huge performance penalty due to the switch in memory space and privilege rings so userspace networking only makes sense if you're not having to deal with the privilege changes or you have DMA.

  • That is only a problem if you do one or more syscalls per packet which is a utterly bone-headed design.

    The copy itself is going at 200-400 Gbps so writing out a standard 1,500 byte (12,000 bit) packet takes 30-60 ns (in steady state with caches being prefetched). Of course you get slaughtered if you stupidly do a syscall (~100 ns hardware overhead) per packet since that is like 300% overhead. You just batch like 32 packets so the write time is ~1,000-2,000 ns then your overhead goes from 300% to 10%.

    At a 1 Gbps throughput, that is ~80,000 packets per second or one packet per ~12.5 us. So, waiting for a 32 packet batch only adds a additional 500 us to your end-to-end latency in return for 4x efficiency (assuming that was your bottleneck; which it is not for these implementations as they are nowhere near the actual limits). If you go up to 10 Gbps, that is only 50 us of added latency, and at 100 Gbps you are only looking at 5 us of added latency for a literal 4x efficiency improvement.

What is done for that is userspace gets the network data directly without (I believe) involving syscalls. It's not something you'd do for end-user software, only the likes of MOFAANG need it.

In theory the likes of io_uring would bring these benefits across the board, but we haven't seen that delivered (yet, I remain optimistic).

Performance comes from dedicating core(s) to polling, not from userspace.

Networking is much faster in the kernel. Even faster on an ASIC.

Network stacks were moved to userspace because Google wanted to replace TCP itself (and upgrade TLS), but it only cared about the browser, so they just put the stack in the browser, and problem solved.