Comment by tliltocatl

1 day ago

That's because Unix API used to assume fork() is extremely cheap. Threads were ugly performance hack second-class citizens - still are in some ways. This was indeed true on PDP-11 (just copy a <64KB disk file!), but as address spaces grew, it became prohibitively expensive to copy page tables, so programmers turned to multithreading. At then multicore CPUs became the norm, and multithreading on multicore CPUs meant any kind of copy-on-write required TLB shootdown, making fork() even more expensive. VMS (and its clone known as Windows NT) did it right from the start - processes are just resource containers, units execution are threads and all IO is async. But being technically superior doesn't outweighs the disadvantage of being proprietary.

It's also a pretty bold scheduler benchmark to be handling tens of thousands of processes or 1:1 thread wakeups, especially the further back in time you go considering fairness issues. And then that's running at the wrong latency granularity for fast I/O completion events across that many nodes so it's going to run like a screen door on a submarine without a lot of rethinking things.

Evented I/O works out pretty well in practice for the I and D cache, especially if you can affine and allocate things as the article states, and do similar natural alignments inside the kernel (i.e. RSS/consistent hashing).