Comment by vlovich123

10 days ago

> Regular code was already async. When you need to wait for an async operation, the thread sleeps until ready and the kernel abstracts it away

Not really. I’ve observed async code often is written in such a way that it doesn’t maximize how much concurrency can be expressed (eg instead of writing “here’s N I/O operations to do them all concurrently” it’s “for operation X, await process(x)”). However, in a threaded world this concurrency problem gets worse because you have no way to optimize towards such concurrency - threads are inherently and inescapably too heavy weight to express concurrency in an efficient way.

This is is not a new lesson - work stealing executors have long been known to offer significantly lower latency with more consistent P99 than traditional threads. This has been known since forever - in the early 00s this is why Apple developed GCD. Threads simply don’t provide any richer information it needs in the scheduler to the kernel about the workload and kernel threads are an insanely heavy mechanism for achieving fine grained concurrency and even worse when this concurrency is I/O or a mixed workload instead of pure compute that’s embarrassingly easily to parallelize.

Do all programs need this level of performance? No, probably not. But it is significantly more trivial to achieve a higher performance bar and in practice achieve a latency and throughput level that traditional approaches can’t match with the same level of effort.

You can tell async is directionally kind of correct in that io_uring is the kernel’s approach to high performance I/O and it looks nothing like traditional threading and syscalls and completion looks a lot closer to async concurrency (although granted exploiting it fully is much harder in an async world because async/await is an insufficient number of colors to express how async tasks interrelate)

20 comments

vlovich123

rfoo 10 days ago

> work stealing executors have long been known to offer significantly lower latency with more consistent P99 than traditional threads. This has been known since forever - in the early 00s

Well, we know how to make "traditional threads" fast, with lower latency and more consistent P99 since forever^2, in the early 90s. [1]

Sure, we can't convince that Finnish guy this is worthwhile to include in THE kernel, despite similar ideas had been running in Google datacenters for idk how many years, 15 years+? But nothing stops us from doing it in the userspace, just as you said, a work stealing executor. And no, no coloring.

Stack is all you need. Just make your "coroutines" stackful. Done. All those attempts trying to be "zero-cost" and change programming model dramatically to avoid a stack, introduced much more overhead than a stack and a piece of decent context switch code.

> You can tell async is directionally kind of correct in that io_uring is the kernel’s approach

lol, it is very hard to model anything proactor like io_uring with async Rust due to its defects.

[1] https://dl.acm.org/doi/10.1145/121132.121151

vlovich123 9 days ago

Stackful coroutines give up a fair amount of efficiency in a number of places to make that workable. It’s fine if you want to use a lot more RAM and Go and Java make that tradeoff, but that’s not suitable for something like Rust. That’s why Rust and C++’s async implementation is rather similar in many ways. Stackful coroutines also play havoc with FFI which carries a huge FFI cost penalty across the board even for code that doesn’t care about coroutines. These aren’t theoretical tradeoffs to just hand wave away as “doesn’t matter” - it literally does for how Rust is positioned. No one is stopping you from using Go or the JVM if that’s the ecosystem you like better.
> lol, it is very hard to model anything proactor like io_uring with async Rust due to its defects.
Not really. People latched on to async cancellation issues as intractable due to one paper but I’m not convinced it’s unsolvable whether due to runtimes that consider the issue more fundamentally or the language adding async drop which lets the existing runtimes solve the problem wholesale.
The point I’m making is that I/O and hardware is fundamentally non-blocking and we will always pay a huge abstraction penalty to try to pretend we have a synchronous programming model.
orf 10 days ago

There are always trade-offs and there is never one best way to do something.
Stack-based coroutines is one way to do it. A relevant trade-off here is overhead, requiring a runtime and narrowing the potential use-cases this can serve (i.e embedded real-time stuff).
If you don’t care about supporting such use cases you can of course just create a copy of goroutines and be pretty happy with the result.
krackers 10 days ago

>despite similar ideas had been running in Google datacenters for idk how many years
I guess this is referring to https://www.youtube.com/watch?v=KXuZi9aeGTw ?

groundzeros2015 10 days ago

I am not saying threads are the model for all programming problems. For example a dependency graph like an excel spreadsheet can be analyzed and parallelized.

But as you observed, async/await fails to express concurrency any better. It’s also a thread, it’s just a worse implementation.

vlovich123 10 days ago
That’s incorrect. Even when expressed suboptimally, it still tends to result in overall higher throughput and consistently lower latency (work stealing executors specifically). And when you’re in this world, you can always do an optimization pass to better express the concurrency. If you’ve not written it async to start with, then you’re boned and have no easy escape hatch to optimize with.
- groundzeros2015 10 days ago
  
  Why can’t you do the same optimization? Are you maxing out you OS system resources on thread overhead?
  
  5 replies →

Hendrikto 10 days ago

> threads are inherently and inescapably too heavy weight to express concurrency in an efficient way

Your premise is wrong. There are many counterexamples to this.

LelouBil 10 days ago
Can you explain more ? I always heard this.
- Hendrikto 10 days ago
  
  The most promiment example is probably Go with its goroutines, but there are so many more. You can easily spawn tens of thousands of goroutines, with low overhead and great performance.
  
  4 replies →