Comment by weitendorf
7 months ago
I don't work at Google anymore and don't have any special insight into the internal adoption of io_uring, but I think it stands to reason that Google would benefit tremendously from rolling out a higher-performing way to do IO across their fleet. I mean, having myself done some lowish-level performance/optimization work and knowing that the impact of these kinds of changes is measurable and the scale is almost fleetwide, I wouldn't be surprised if the benefits - after major internal libraries/tools are also updated to use io_uring - are O(Really Big Money)
Having talked to members of their prodkernel team about other subjects, I also think they are competent enough to know the difference between "not ready" and "acceptably flawed". And believe me, the incentives are such that O(Really Big Money) optimization projects get staffed unless there is something making them infeasible.
Not everybody has the same threat model and security stance as Google and that's ok. But personally I would take their internal adoption of io_uring very seriously as a measure of whether it's safe for me to adopt it, especially if I'm running untrusted or third party software (including certain kinds of libraries).
> the incentives are such that O(Really Big Money) optimization projects get staffed unless there is something making them infeasible.
Switching to io_uring is not just moving from one API to another. It necessitates a serious rethinking of your concurrency model. I guess for big, established codebases this is a very substantial undertaking, security consideration notwithstanding.
On the library/internal workload side the impact would certainly not be something that fully lands overnight, but Google has a very centralized tech stack and special tooling for fleetwide code migrations. I have no insight to the particulars but I would guess there is a Pareto-like distribution of easy upgrades+big wins and a long-tail of marginal/thorny upgrades.
Google is big enough and invests enough in infrastructure projects that they staff projects like making their own internal concurrency primitives (side note, factors like this can improve/reduce or simplify/complexify migrations substantially): https://www.phoronix.com/news/Google-Fibers-Toward-Open
Eh let's not be dramatic, if you're already using async runtimes of some sort it's not that much of an upset to switch.
No, actually there are other considerations you don't have with the classic "poll for readiness then read/write" model, such as holding on to buffers until a completion is received, managing registered files and buffer rings, and multishot ops. It's really a different model, especially if you employ traditional thread based concurrency.
Disabling it on Android and ChromeOS does not mean they don't use it internally. Android and ChromeOS is end user devices, optimizing those platforms don't earn google any money.
Can you find anywhere that states that they are using it internally? They have publicly stated at various points that they do not, such as at https://security.googleblog.com/2023/06/learnings-from-kctf-... and I have not seen anything yet stating that they are now using it. Also, you might want to reread my comment because I wasn't talking about Android/ChromeOS, it was exclusively about their "fleet" by which I meant "servers"
By the way, here is a good + recent example of the types of CVEs that IO_uring runs into that google finds and discloses/fixes: https://project-zero.issues.chromium.org/issues/417522668. Here's another: https://project-zero.issues.chromium.org/issues/388499293
Given that io_uring mostly seems to be the project of one guy at Meta, and has a regular stream of new and exciting use after free/out of bounds vulnerabilities, I think it makes sense for security-inclined users to disable it or at least only use it once soaked/stabilized
GP: > > as well as Google servers
I guess I can't read. Thanks for the correction.
them disabling it is only about Android/Chrome
not about their servers
I wouldn't be surprised if they do have servers with it enabled when very useful.
and Android Linux kennels lack behind in their version
No, it was about servers, and I worked there on similar stuff/with the same people involved in the serverside ("fleetwide") rollout. Public post describing the decision to disable it internally: https://security.googleblog.com/2023/06/learnings-from-kctf-...
I'd love to see a post explaining a decision to consider it stable or that mentions that they've rolled it out on their fleet