Comment by weitendorf

7 months ago

I don't work at Google anymore and don't have any special insight into the internal adoption of io_uring, but I think it stands to reason that Google would benefit tremendously from rolling out a higher-performing way to do IO across their fleet. I mean, having myself done some lowish-level performance/optimization work and knowing that the impact of these kinds of changes is measurable and the scale is almost fleetwide, I wouldn't be surprised if the benefits - after major internal libraries/tools are also updated to use io_uring - are O(Really Big Money)

Having talked to members of their prodkernel team about other subjects, I also think they are competent enough to know the difference between "not ready" and "acceptably flawed". And believe me, the incentives are such that O(Really Big Money) optimization projects get staffed unless there is something making them infeasible.

Not everybody has the same threat model and security stance as Google and that's ok. But personally I would take their internal adoption of io_uring very seriously as a measure of whether it's safe for me to adopt it, especially if I'm running untrusted or third party software (including certain kinds of libraries).

> the incentives are such that O(Really Big Money) optimization projects get staffed unless there is something making them infeasible.

Switching to io_uring is not just moving from one API to another. It necessitates a serious rethinking of your concurrency model. I guess for big, established codebases this is a very substantial undertaking, security consideration notwithstanding.

  • On the library/internal workload side the impact would certainly not be something that fully lands overnight, but Google has a very centralized tech stack and special tooling for fleetwide code migrations. I have no insight to the particulars but I would guess there is a Pareto-like distribution of easy upgrades+big wins and a long-tail of marginal/thorny upgrades.

    Google is big enough and invests enough in infrastructure projects that they staff projects like making their own internal concurrency primitives (side note, factors like this can improve/reduce or simplify/complexify migrations substantially): https://www.phoronix.com/news/Google-Fibers-Toward-Open

  • Eh let's not be dramatic, if you're already using async runtimes of some sort it's not that much of an upset to switch.

    • No, actually there are other considerations you don't have with the classic "poll for readiness then read/write" model, such as holding on to buffers until a completion is received, managing registered files and buffer rings, and multishot ops. It's really a different model, especially if you employ traditional thread based concurrency.

Disabling it on Android and ChromeOS does not mean they don't use it internally. Android and ChromeOS is end user devices, optimizing those platforms don't earn google any money.

them disabling it is only about Android/Chrome

not about their servers

I wouldn't be surprised if they do have servers with it enabled when very useful.

and Android Linux kennels lack behind in their version