← Back to context

Comment by topspin

3 days ago

In your high level "You might not want to use it if" points, you mention Docker but not why, and that's odd. I happen to know why: io_uring syscalls are blocked by default in Docker, because io_uring is a large surface area for attacks, and this has proven to be a real problem in practice. Others won't know this, however. They also won't know that io_uring is similarly blocked in widely used cloud sandboxes, Android, and elsewhere. Seems like a fine place to point this stuff out: anyone considering io_uring would want to know about these issues.

Very good point! You’re absolutely right: The fact that io_uring is blocked by default in Docker and other sandboxes due to security concerns is important context, and we should have mentioned it explicitly there. We'll update the post, and happy to incorporate any other caveats you think are worth calling out.

Is this something likely to ever change?

  • I believe it's possible, but that it's a hard problem requiring great effort. I believe this is a opportunity to apply formal methods ah la seL4, that nothing less will be sufficient, and that the value of io_uring is great enough to justify it. That will take a lot of talent and hours.

    I admire io_uring. I appreciate the fact that it exists and continues despite the security problems; evidence that security "concerns" don't (yet) have a veto over all things Linux. The design isn't novel. High performance hardware (NICs, HBAs, codecs, etc.) have used similar techniques for a long time. Io_uring only brings this to user space and generalizes it. I imagine an OS and hardware that fully inculcate the pattern, obviating the need for context switches, interrupts, blocking and other conventional approaches we've slouched into since the inception of computing.

    • Alternatively, it requires cloud providers and such losing business if they refuse to support the latest features.

      The "surface area" argument against io_uring can apply to literally any innovation. Over on LWN, there's an article on path traversal difficulties that mentions people how, because openat2(2) is often banned as inconvenient to whitelist using seccomp, eople have to work around path traversal bugs using fiddly, manual, and slow element-by-element path traversal in user space.

      Ridiculous security theater. A new system call had a vulnerability in 2010 and so we're never able to take practical advantage of new kernel features ever?

      (It doesn't help that gvisor refuses to acknowledge the modern world.)

      Great example of descending into a shitty equilibrium because the great costs of a bad policy are diffuse but the slight benefits are concentrated.

      The only effective lever is commercial pressure. All the formal methods in the world won't help when the incentive structure reinforces technical obstinacy.

  • It already did with the io_uring worker rewrite in 5.12 (2021) which made it much safer.

    https://github.com/axboe/liburing/discussions/1047

    • I can't agree with this. There is ample evidence of serious flaws since 2021. I hate that. I wish it weren't true. But an objective analysis of the record demands that view.

      Here is a fun one from September (CVE-2025-39816): "io_uring/kbuf: always use READ_ONCE() to read ring provided buffer lengths."

      That is an attackers wet dream right there: bump the length and exfiltrate sensitive data. And it wasn't just some short lived "Linus's branch" work no one actually ran: it existed for a time in, for example, Ubuntu 24.04 LTS (circa 2024 release date.) I just cherry picked that one from among many.

  • It’s manageable with eBPF instead of seccomp so one has to adapt to that. Should be doable.

    • Maybe not so doable. The whole point of io_uring is to reduce syscalls. So you end up just three. io_uring_setup, io_uring_register, io_uring_enter

      There is now a memory buffer that the user space and the kernel is reading, and with that buffer you can _always_ do any syscall that io_uring supports. And things like strace, eBPF, and seccomp cannot see the actual syscalls that are being called in that memory buffer.

      And, having something like seccomp or eBPF inspect the stream might slow it down enough to eat the performance gain.

      4 replies →