Comment by weitendorf

7 months ago

I don't work at Google anymore and don't have any special insight into the internal adoption of io_uring, but I think it stands to reason that Google would benefit tremendously from rolling out a higher-performing way to do IO across their fleet. I mean, having myself done some lowish-level performance/optimization work and knowing that the impact of these kinds of changes is measurable and the scale is almost fleetwide, I wouldn't be surprised if the benefits - after major internal libraries/tools are also updated to use io_uring - are O(Really Big Money)

Having talked to members of their prodkernel team about other subjects, I also think they are competent enough to know the difference between "not ready" and "acceptably flawed". And believe me, the incentives are such that O(Really Big Money) optimization projects get staffed unless there is something making them infeasible.

Not everybody has the same threat model and security stance as Google and that's ok. But personally I would take their internal adoption of io_uring very seriously as a measure of whether it's safe for me to adopt it, especially if I'm running untrusted or third party software (including certain kinds of libraries).

10 comments

weitendorf

ciconia 7 months ago

> the incentives are such that O(Really Big Money) optimization projects get staffed unless there is something making them infeasible.

Switching to io_uring is not just moving from one API to another. It necessitates a serious rethinking of your concurrency model. I guess for big, established codebases this is a very substantial undertaking, security consideration notwithstanding.

weitendorf 7 months ago

On the library/internal workload side the impact would certainly not be something that fully lands overnight, but Google has a very centralized tech stack and special tooling for fleetwide code migrations. I have no insight to the particulars but I would guess there is a Pareto-like distribution of easy upgrades+big wins and a long-tail of marginal/thorny upgrades.
Google is big enough and invests enough in infrastructure projects that they staff projects like making their own internal concurrency primitives (side note, factors like this can improve/reduce or simplify/complexify migrations substantially): https://www.phoronix.com/news/Google-Fibers-Toward-Open
junon 7 months ago
Eh let's not be dramatic, if you're already using async runtimes of some sort it's not that much of an upset to switch.
- ciconia 7 months ago
  
  No, actually there are other considerations you don't have with the classic "poll for readiness then read/write" model, such as holding on to buffers until a completion is received, managing registered files and buffer rings, and multishot ops. It's really a different model, especially if you employ traditional thread based concurrency.

delusional 7 months ago

Disabling it on Android and ChromeOS does not mean they don't use it internally. Android and ChromeOS is end user devices, optimizing those platforms don't earn google any money.

weitendorf 7 months ago

Can you find anywhere that states that they are using it internally? They have publicly stated at various points that they do not, such as at https://security.googleblog.com/2023/06/learnings-from-kctf-... and I have not seen anything yet stating that they are now using it. Also, you might want to reread my comment because I wasn't talking about Android/ChromeOS, it was exclusively about their "fleet" by which I meant "servers"
By the way, here is a good + recent example of the types of CVEs that IO_uring runs into that google finds and discloses/fixes: https://project-zero.issues.chromium.org/issues/417522668. Here's another: https://project-zero.issues.chromium.org/issues/388499293
Given that io_uring mostly seems to be the project of one guy at Meta, and has a regular stream of new and exciting use after free/out of bounds vulnerabilities, I think it makes sense for security-inclined users to disable it or at least only use it once soaked/stabilized
rahkiin 7 months ago
GP: > > as well as Google servers
- delusional 7 months ago
  
  I guess I can't read. Thanks for the correction.

dathinab 7 months ago

them disabling it is only about Android/Chrome

not about their servers

I wouldn't be surprised if they do have servers with it enabled when very useful.

and Android Linux kennels lack behind in their version

weitendorf 7 months ago

No, it was about servers, and I worked there on similar stuff/with the same people involved in the serverside ("fleetwide") rollout. Public post describing the decision to disable it internally: https://security.googleblog.com/2023/06/learnings-from-kctf-...
I'd love to see a post explaining a decision to consider it stable or that mentions that they've rolled it out on their fleet