← Back to context

Comment by vjerancrnjak

7 months ago

I think it's the Wikipedia article.

https://en.wikipedia.org/wiki/Io_uring

Very easy to just quote that without any io_uring experience.

> In June 2023, Google's security team reported that 60% of the exploits submitted to their bug bounty program in 2022 were exploits of the Linux kernel's io_uring vulnerabilities. As a result, io_uring was disabled for apps in Android, and disabled entirely in ChromeOS as well as Google servers. Docker also consequently disabled io_uring from their default seccomp profile.

I don't work at Google anymore and don't have any special insight into the internal adoption of io_uring, but I think it stands to reason that Google would benefit tremendously from rolling out a higher-performing way to do IO across their fleet. I mean, having myself done some lowish-level performance/optimization work and knowing that the impact of these kinds of changes is measurable and the scale is almost fleetwide, I wouldn't be surprised if the benefits - after major internal libraries/tools are also updated to use io_uring - are O(Really Big Money)

Having talked to members of their prodkernel team about other subjects, I also think they are competent enough to know the difference between "not ready" and "acceptably flawed". And believe me, the incentives are such that O(Really Big Money) optimization projects get staffed unless there is something making them infeasible.

Not everybody has the same threat model and security stance as Google and that's ok. But personally I would take their internal adoption of io_uring very seriously as a measure of whether it's safe for me to adopt it, especially if I'm running untrusted or third party software (including certain kinds of libraries).

  • > the incentives are such that O(Really Big Money) optimization projects get staffed unless there is something making them infeasible.

    Switching to io_uring is not just moving from one API to another. It necessitates a serious rethinking of your concurrency model. I guess for big, established codebases this is a very substantial undertaking, security consideration notwithstanding.

    • On the library/internal workload side the impact would certainly not be something that fully lands overnight, but Google has a very centralized tech stack and special tooling for fleetwide code migrations. I have no insight to the particulars but I would guess there is a Pareto-like distribution of easy upgrades+big wins and a long-tail of marginal/thorny upgrades.

      Google is big enough and invests enough in infrastructure projects that they staff projects like making their own internal concurrency primitives (side note, factors like this can improve/reduce or simplify/complexify migrations substantially): https://www.phoronix.com/news/Google-Fibers-Toward-Open

  • Disabling it on Android and ChromeOS does not mean they don't use it internally. Android and ChromeOS is end user devices, optimizing those platforms don't earn google any money.

  • them disabling it is only about Android/Chrome

    not about their servers

    I wouldn't be surprised if they do have servers with it enabled when very useful.

    and Android Linux kennels lack behind in their version

Without going into the weeds, there has be some vendor support, and that vendor is obviously not google. How to convince people: Get it into RHEL.

  • io_uring is available from RHEL 9.3 onward. The catch is that it's disabled by default and needs to be enabled at runtime via the "kernel.io_uring_disabled" sysctl.

  • If that's the case, it's not indicated by the quote. The quote lays all the blame on io_uring. Is that incorrect?

yes but what this isn't telling you is that android has a long history of running hopelessly outdated kennels and it being very common that Linux kernel related android cves related to newish features have already been fixed upstream by generic improvements to that feature code

I like how someone helpfully added

> Although initial async offload design in io_uring could be problematic, later kernels changed the thread model. After such improvements, there were no known inherent problems with it and its development is very careful with new features. Considering that a performant async framework with a user facing API is complex, it was to be expected that issues would be found initially. After initial issues have been addressed, it is not any less secure than anything else in the kernel and io_uring acceptance quickly grew in production. Some of its criticism are also based on wrong or outdated assumptions.[14]

...but the only citation is a link to this GH thread, which doesn't support the claims made.