Comment by aliceryhl
1 day ago
> IIRC Alice from the tokio team also suggested there hasn't been much interest in pushing through these difficulties more recently, as the current performance is "good enough".
Well, I think there is interest, but mostly for file IO.
For file IO, the situation is pretty simple. We already have to implement that using spawn_blocking, and spawn_blocking has the exact same buffer challenges as io_uring does, so translating file IO to io_uring is not that tricky.
On the other hand, I don't think tokio::net's existing APIs will support io_uring. Or at least they won't support the buffer-based io_uring APIs; there is no reason they can't register for readiness through io_uring.
This covers probably 90% of the usefulness of io_uring for non-niche applications. Its original purpose was doing buffered async file IO without a bunch of caveats that make it effectively useless. The biggest speed up I’ve found with it is ‘stat’ing large sets of files in the VFS cache. It can literally be 50x faster at that, since you can do 1000 files with a single systemcall and the data you need from the kernel is all in memory.
High throughput network usecases that don’t need/want AF_XDP or DPDK can get most of the speedup with ‘sendmmsg/recvmmsg’ and segmentation offload.
For TCP streams syscall overhead isn't a big issue really, you can easily transfer large chunks of data in each write(). If you have TCP segmentation offload available you'll have no serious issues pushing 100gbit/s. Also if you are sending static content don't forget sendfile().
UDP is a whole another kettle of fish, get's very complicated to go above 10gbit/s or so. This is a big part of why QUIC really struggles to scale well for fat pipes [1]. sendmmsg/recvmmsg + UDP GRO/GSO will probably get you to ~30gbit/s but beyond that is a real headache. The issue is that UDP is not stream focused so you're making a ton of little writes and the kernel networking stack as of today does a pretty bad job with these workloads.
FWIW even the fastest QUIC implementations cap out at <10gbit/s today [2].
Had a good fight writing a ~20gbit userspace UDP VPN recently. Ended up having to bypass the kernels networking stack using AF_XDP [3].
I'm available for hire btw, if you've got an interesting networking project feel free to reach out.
1. https://arxiv.org/abs/2310.09423
2. https://microsoft.github.io/msquic/
3. https://github.com/apoxy-dev/icx/blob/main/tunnel/tunnel.go
Yeah all agreed - the only addendum I’d add is for cases where you can’t use large buffers because you don’t have the data (e.g. realtime data streams or very short request/reply cycles). These end up having the same problems, but are not soluble by TCP or UDP segmentation offloads. This is where reduced syscall overhead (or even better kernel bypass) really shines for networking.
I have a hard time believing that google is serving YouTube over QUIC/HTTP3 at 10Gbit/s, or even 30Gbit/s.
1 reply →