Perhaps someone who knows what they're talking about should update the Wikipedia page on io_uring [1]. Someone with a casual interest in Linux internals will probably get a poor impression of io_uring security which appears to be largely due to Google using an old kernel in Android [2].
It still does not hook up to seccomp, so needs to be blocked by things doing syscall filtering. Its blocked by docker/podman. It may also be disabled with hardened kconfig or selinux.
If it ever integrates with LSMs, then it may be time to give it another look.
Also, the Zig 0.16.0 preview nightly builds includes a new Io library[0]. I have not used libxev or Tigerbeetles event loop, but I believe the standard Zig implementation is likely largely influenced by those two.
I’m curious, how do you know it was inspired by tiger beetles impl?
They look very similar so that makes sense, just curious on the order of events.
Also I tried using libxev for a project of mine and found it really broke the zig way of doing things. All these callbacks needed to return disarm/rearm instead of error unions so I had to catch every single error instead of being able to use try.
I could have reworked it further to make try work but found the entire thing very verbose and difficult to use with 6 params for all the callback functions.
Thankfully my use case was such that poll() was more than sufficient and that is part of zigs posix namespace so that was what I went with.
I love NT's IO completion ports. I think kqueue is very similar, right? Honestly I've been able to get by using boost asio for cross platform needs but I've always wanted to see if there are better solutions. I think libuv is similar, since it is what node is based on, but I'm not entirely sure what the underlying tech is for non-Windows
kqueue is similar to epoll, it's readiness based and not completion like IOCP and io_uring. IOCP is nice in theory, but the api and the different ways everything has to be fed for it leaves a lot to be desired... Windows also has own version of iouring, but it's bit abandoned and only works for disk io which is shame, because it could have been nice new clean io api for windows.
> the api and the different ways everything has to be fed for it leaves a lot to be desired
I think Microsoft fixed that in Windows Vista by providing a higher-level APIs on top of IOCP. See CreateThreadpoolIo, CloseThreadpoolIo, StartThreadpoolIo, and WaitForThreadpoolIoCallbacks WinAPI functions.
I’ve been enjoying the rust compio library lately which abstracts over io_uring on Linux. And IOCP and friends on windows. And it falls back to kqueue on macOS and presumably FreeBSD.
It’s wonderful being able to write straightforward code that works fast on every platform with no code changes.
I guess the strength of rust (and zig for now) is that the community has a chance to explore lots of different ways to solve these problems. And the corresponding weakness is that everyone uses different libraries, so it’s a fragmented ecosystem full of libraries that may or may not work together properly.
There was a brief fascination with user mode TCP over DPDK (or similar). What happened with that? Can you get similar performance with QUIC? Does io_uring make it all a moot point?
I've only done a little prototyping with it, but io_uring addresses the same issue as DPDK, but in a totally different way. If you want high perf, you want to avoid context switches between userland and kernelland; you have DPDK which brings the NIC buffers into userland and bypasses the kernel, you have things like sendfile and kTLS which lets the kernel do most of the work and bypasses userland; and you have io_uring which lets you do the same syscalls as you're doing now, but a) in a batch format, b) also in a continuous form with a submission queue thing. I think it's easier to reach for io_uring than DPDK, but it might not get you as far as DPDK; you're still communicating between kernel and userland, but it's better than normal syscalls.
> Can you get similar performance with QUIC?
I don't know that I've seen benchmarks, but I'd be surprised if you can get similar performance with QUIC. TCP has decades of optimization that you can lean on, UDP for bulk transfer really doesn't. For a lot of applications, server performance from QUIC vs TCP+TLS isn't a big deal, because you'll spend much more server performance on computing what to send than on sending it... For static file serving, I'd be surprised if QUIC is actually competitive, but it still might not be a big deal if your server is overpowered and can hit the NIC limits with either.
It is fairly straightforward to implement QUIC transport at ~100 Gb/s per core without encryption which is comparable or better than TCP. With encryption, every protocol will bottleneck on the encryption and only get a mere 40-50 Gb/s per core unless you have dedicated crypto offload hardware.
However, the highest performance public QUIC implementation benchmarks only get ~10 Gb/s per core. It is unclear to me if this is due to slow QUIC implementations or poor UDP stacks with inadequate buffering and processing.
At least to me, one of the most compelling parts of QUIC is that you establish a connection with TLS without needing extra round trips compared to TCP, where there are separate handshakes for the connection and then the TLS initialization. Even if it was no faster than TCP from that point forward, that seems like enough to make the protocol worthwhile in today's world where TLS is the basically the rule with relatively few exceptions rather than an occasion use case.
It's also something I just find fascinating because it's one of the few practical cases where I feel like the compositional approach has what seems to be an insurmountable disadvantage compared to making a single thing more complex. Maybe there are a lot more of them that just aren't obvious to me because the "larger" thing is already so well-established that I wouldn't consider breaking it into smaller pieces because of the inherent advantage from having them combined, but even then it still seems surprising that that gold standard for so long arguably because of how well it worked with things that came after eventually run into change in expectations that it can't adapt to as well as something with intentionally larger scope to include one of those compositional layers.
If someone with leverage (probably Apple) was willing to put the effort to push it, we could have TCP Fast Open, and you wouldn't need an extra round trip for TCP+TLS. But also note, TLS 1.3 (and TLS 1.2 FalseStart) only add one round trip ontop of TCP; going down from 2 round trips to 1 is nice, but sometimes the QUIC sales sheets claim 3 to 1; if you can deploy QUIC, you can deploy 2 handshake tcp+tls.
Apple put in effort to get MPTCP accepted in cellular networks (where they have direct leverage) and having it out there (used by Siri) puts pressure on other networks too. If they did the same thing for Fast Open (SYN with data), it could be big.
Unfortunately, I'm not sure anyone other than Apple is capable of doing it. Nobody else really has leverage against enough carriers to demand they make new TCP patterns work; and not many organizations would want to try adding something to SYNs that might fail. (Also, MPTCP allows session movement, so a new TLS handshake isn't required)
That is because providing a reliable stream over a stateful connection is actually about a half-dozen layers of abstraction.
TCP couples them all in a large monolithic, tangled mess. QUIC, despite being a little more complex, has the layers much less coupled even though it is still a monolithic blob.
A better network protocol design would be actually fully decoupling the layers then building something like QUIC as a composition of those layers. This is high performance and lets you flexibly handle basically the entire gamut of network protocols currently in use.
> You can switch a file descriptor into non-blocking mode so the call won’t block while data you requested is not available. But system calls are still expensive, incurring context switches and cache misses. In fact, networks and disks have become so fast that these costs can start to approach the cost of doing the I/O itself. For the duration of time a file descriptor is unable to read or write, you don’t want to waste time continuously retrying read or write system calls.
O_NONBLOCK basically doesn't do anything for file-based file-descriptions - a file is always considered "ready" for I/O.
Think about it, what does it means for a file to be ready? Socket and pipes are a stream abstraction: To be ready it means that there is data to read or space to write.
But for files data is always available to read (unless the file is empty) or write (unless the disk is full). Even if you somehow interpret readiness as the backing pages being loaded in the page cache, files are random access so which pages (ie which specific offset and length) you are interested in can't be expressed via a simple fd based poll-like API (Linux tried to make splice work for this use case, but it didn't work out).
I think you’re correct. Your file descriptor may represent an end of a pipe, which in turn is backed by a buffer of limited size. Ruby’s I/O API specifically warns that reading lop-sidedly from e.g. stdout and stderr without `select`ing is dangerous [0].
I’ve experienced deadlocks in well-known programs, because developers who were unaware of this issue did a synchronous round-robin loop over stdout and stderr. [1]
Some people would rather have an abstraction over io_uring and kqueue rather than choosing a single API that works everywhere they want to run, choosing to only run on the OS that provides the API they prefer, or writing their loop (and anything else) for all the APIs they want to support.
But I agree with you; I'd rather use the thing without excess abstraction, and the standard apis work well enough for most applications. Some things do make sense to do the work to increase performance though.
In the real world, unless are writing a very specialized system, intended to run only on Linux 6.0 and never, it just is not realistic and you will need some sort of abstraction layer to support at the very least additionally poll to be portable across all POSIX and POSIX like systems. Then if you want your thing to also run on Windows, IOCP rides in too...
I used 6.0 because 5.8-5.9 is roughly when io_uring became interesting to use for most use cases with zero copies, prepared buffers and other goodies, and 6.0 is roughly when people finally started being able to craft benchmarks where io_uring implementations beat epoll.
Perhaps someone who knows what they're talking about should update the Wikipedia page on io_uring [1]. Someone with a casual interest in Linux internals will probably get a poor impression of io_uring security which appears to be largely due to Google using an old kernel in Android [2].
[1] https://en.wikipedia.org/wiki/Io_uring [2] https://github.com/axboe/liburing/discussions/1047
It still does not hook up to seccomp, so needs to be blocked by things doing syscall filtering. Its blocked by docker/podman. It may also be disabled with hardened kconfig or selinux.
If it ever integrates with LSMs, then it may be time to give it another look.
I suppose landlock works with is_uring, doesn't it?
Also worth checking out libxev[1] by Mitchell Hashimoto. It’s a Zig based event loop (similar to libuv) inspired by Tigerbeetle’s implementation.
[1] https://github.com/mitchellh/libxev
Also, the Zig 0.16.0 preview nightly builds includes a new Io library[0]. I have not used libxev or Tigerbeetles event loop, but I believe the standard Zig implementation is likely largely influenced by those two.
[0] https://ziglang.org/documentation/master/std/#std.Io, or https://ziglang.org/documentation/0.16.0/std/#std.Io after the release
I’m curious, how do you know it was inspired by tiger beetles impl?
They look very similar so that makes sense, just curious on the order of events.
Also I tried using libxev for a project of mine and found it really broke the zig way of doing things. All these callbacks needed to return disarm/rearm instead of error unions so I had to catch every single error instead of being able to use try.
I could have reworked it further to make try work but found the entire thing very verbose and difficult to use with 6 params for all the callback functions.
Thankfully my use case was such that poll() was more than sufficient and that is part of zigs posix namespace so that was what I went with.
I love NT's IO completion ports. I think kqueue is very similar, right? Honestly I've been able to get by using boost asio for cross platform needs but I've always wanted to see if there are better solutions. I think libuv is similar, since it is what node is based on, but I'm not entirely sure what the underlying tech is for non-Windows
kqueue is similar to epoll, it's readiness based and not completion like IOCP and io_uring. IOCP is nice in theory, but the api and the different ways everything has to be fed for it leaves a lot to be desired... Windows also has own version of iouring, but it's bit abandoned and only works for disk io which is shame, because it could have been nice new clean io api for windows.
I don't know how it compares, but for sockets Windows also has Registered I/O: https://learn.microsoft.com/en-us/previous-versions/windows/...
> the api and the different ways everything has to be fed for it leaves a lot to be desired
I think Microsoft fixed that in Windows Vista by providing a higher-level APIs on top of IOCP. See CreateThreadpoolIo, CloseThreadpoolIo, StartThreadpoolIo, and WaitForThreadpoolIoCallbacks WinAPI functions.
I’ve been enjoying the rust compio library lately which abstracts over io_uring on Linux. And IOCP and friends on windows. And it falls back to kqueue on macOS and presumably FreeBSD.
It’s wonderful being able to write straightforward code that works fast on every platform with no code changes.
I guess the strength of rust (and zig for now) is that the community has a chance to explore lots of different ways to solve these problems. And the corresponding weakness is that everyone uses different libraries, so it’s a fragmented ecosystem full of libraries that may or may not work together properly.
> Hey, maybe we’ll split this out so you can use it too. It’s written in Zig so we can easily expose a C API.
This never happened, did it?
Suppose libex is the alternative.
Using dispatch looks like redux. I guess same paradigm just different layer
Awesome post, please make a Zig library!
There was a brief fascination with user mode TCP over DPDK (or similar). What happened with that? Can you get similar performance with QUIC? Does io_uring make it all a moot point?
I've only done a little prototyping with it, but io_uring addresses the same issue as DPDK, but in a totally different way. If you want high perf, you want to avoid context switches between userland and kernelland; you have DPDK which brings the NIC buffers into userland and bypasses the kernel, you have things like sendfile and kTLS which lets the kernel do most of the work and bypasses userland; and you have io_uring which lets you do the same syscalls as you're doing now, but a) in a batch format, b) also in a continuous form with a submission queue thing. I think it's easier to reach for io_uring than DPDK, but it might not get you as far as DPDK; you're still communicating between kernel and userland, but it's better than normal syscalls.
> Can you get similar performance with QUIC?
I don't know that I've seen benchmarks, but I'd be surprised if you can get similar performance with QUIC. TCP has decades of optimization that you can lean on, UDP for bulk transfer really doesn't. For a lot of applications, server performance from QUIC vs TCP+TLS isn't a big deal, because you'll spend much more server performance on computing what to send than on sending it... For static file serving, I'd be surprised if QUIC is actually competitive, but it still might not be a big deal if your server is overpowered and can hit the NIC limits with either.
It is fairly straightforward to implement QUIC transport at ~100 Gb/s per core without encryption which is comparable or better than TCP. With encryption, every protocol will bottleneck on the encryption and only get a mere 40-50 Gb/s per core unless you have dedicated crypto offload hardware.
However, the highest performance public QUIC implementation benchmarks only get ~10 Gb/s per core. It is unclear to me if this is due to slow QUIC implementations or poor UDP stacks with inadequate buffering and processing.
At least to me, one of the most compelling parts of QUIC is that you establish a connection with TLS without needing extra round trips compared to TCP, where there are separate handshakes for the connection and then the TLS initialization. Even if it was no faster than TCP from that point forward, that seems like enough to make the protocol worthwhile in today's world where TLS is the basically the rule with relatively few exceptions rather than an occasion use case.
It's also something I just find fascinating because it's one of the few practical cases where I feel like the compositional approach has what seems to be an insurmountable disadvantage compared to making a single thing more complex. Maybe there are a lot more of them that just aren't obvious to me because the "larger" thing is already so well-established that I wouldn't consider breaking it into smaller pieces because of the inherent advantage from having them combined, but even then it still seems surprising that that gold standard for so long arguably because of how well it worked with things that came after eventually run into change in expectations that it can't adapt to as well as something with intentionally larger scope to include one of those compositional layers.
If someone with leverage (probably Apple) was willing to put the effort to push it, we could have TCP Fast Open, and you wouldn't need an extra round trip for TCP+TLS. But also note, TLS 1.3 (and TLS 1.2 FalseStart) only add one round trip ontop of TCP; going down from 2 round trips to 1 is nice, but sometimes the QUIC sales sheets claim 3 to 1; if you can deploy QUIC, you can deploy 2 handshake tcp+tls.
Apple put in effort to get MPTCP accepted in cellular networks (where they have direct leverage) and having it out there (used by Siri) puts pressure on other networks too. If they did the same thing for Fast Open (SYN with data), it could be big.
Unfortunately, I'm not sure anyone other than Apple is capable of doing it. Nobody else really has leverage against enough carriers to demand they make new TCP patterns work; and not many organizations would want to try adding something to SYNs that might fail. (Also, MPTCP allows session movement, so a new TLS handshake isn't required)
That is because providing a reliable stream over a stateful connection is actually about a half-dozen layers of abstraction.
TCP couples them all in a large monolithic, tangled mess. QUIC, despite being a little more complex, has the layers much less coupled even though it is still a monolithic blob.
A better network protocol design would be actually fully decoupling the layers then building something like QUIC as a composition of those layers. This is high performance and lets you flexibly handle basically the entire gamut of network protocols currently in use.
> You can switch a file descriptor into non-blocking mode so the call won’t block while data you requested is not available. But system calls are still expensive, incurring context switches and cache misses. In fact, networks and disks have become so fast that these costs can start to approach the cost of doing the I/O itself. For the duration of time a file descriptor is unable to read or write, you don’t want to waste time continuously retrying read or write system calls.
O_NONBLOCK basically doesn't do anything for file-based file-descriptions - a file is always considered "ready" for I/O.
Is that true for all file abstractions? What happens with NFS?
Think about it, what does it means for a file to be ready? Socket and pipes are a stream abstraction: To be ready it means that there is data to read or space to write.
But for files data is always available to read (unless the file is empty) or write (unless the disk is full). Even if you somehow interpret readiness as the backing pages being loaded in the page cache, files are random access so which pages (ie which specific offset and length) you are interested in can't be expressed via a simple fd based poll-like API (Linux tried to make splice work for this use case, but it didn't work out).
5 replies →
I think you’re correct. Your file descriptor may represent an end of a pipe, which in turn is backed by a buffer of limited size. Ruby’s I/O API specifically warns that reading lop-sidedly from e.g. stdout and stderr without `select`ing is dangerous [0].
I’ve experienced deadlocks in well-known programs, because developers who were unaware of this issue did a synchronous round-robin loop over stdout and stderr. [1]
[0]: https://docs.ruby-lang.org/en/master/Open3.html#method-c-pop...
[1]: https://github.com/Homebrew/homebrew-cask/pull/21665
When I am already using things like io_uring already I don’t need any io abstraction.
BTW most of applications is totally fine with a UNIX file apis.
Some people would rather have an abstraction over io_uring and kqueue rather than choosing a single API that works everywhere they want to run, choosing to only run on the OS that provides the API they prefer, or writing their loop (and anything else) for all the APIs they want to support.
But I agree with you; I'd rather use the thing without excess abstraction, and the standard apis work well enough for most applications. Some things do make sense to do the work to increase performance though.
In the real world, unless are writing a very specialized system, intended to run only on Linux 6.0 and never, it just is not realistic and you will need some sort of abstraction layer to support at the very least additionally poll to be portable across all POSIX and POSIX like systems. Then if you want your thing to also run on Windows, IOCP rides in too...
I used 6.0 because 5.8-5.9 is roughly when io_uring became interesting to use for most use cases with zero copies, prepared buffers and other goodies, and 6.0 is roughly when people finally started being able to craft benchmarks where io_uring implementations beat epoll.