Comment by kev009
1 day ago
It wasn't just CGI, every HTTP session was commonly a forked copy of the entire server in the CERN and Apache lineage! Apache gradually had better answers, but their API with common addons made it a bit difficult to transition so webservers like nginx took off which are built closer to the architecture in the article with event driven I/O from the beginning.
And there's nothing wrong with that for application workers. On *nix systems fork() is very fast, you can fork "the entire server" and the kernel will only COW your memory. As nginx etc. showed you can get better raw file serving performance with other models, but it's still a legitimate technique for application logic where business logic will drown out any process overhead.
Forking for anything other than calling exec is still a horrible idea (with special exceptions like shells). Forking is a very unsafe operation (you can easily share locks and files with the child process unless both your code and every library you use is very careful - for example, it's easy to get into malloc deadlocks with forked processes), and its performance depends a lot on how you actually use it.
I think it's not quite that bad (and I know that this has been litigated to death all over the programmer internet).
If you are forking from a language/ecosystem that is extremely thread-friendly, (e.g. Go, Java, Erlang) fork is more risky. This is because such runtimes mean a high likelihood of there being threads doing fork-unsafe things at the moment of fork().
If you are forking from a language/ecosystem that is thread-unfriendly, fork is less risky. That isn't to say "it's always safe/low risk to run fork() in e.g. Python, Ruby, Perl", but in those contexts it's easier to prove/test invariants like "there are no threads running/so-and-so lock is not held at the point in my program when I fork", at which point the risks of fork(2) are much reduced.
To be clear, "reduced" is not the same as "gone"! You still have to reason about explicitly taken locks in the forking thread, file descriptors, signal handlers, and unexpected memory growth due to CoW/GC interactions. But that's a lot more tractable than the Java situation of "it's tricky to predict how many Java threads are active when I want to fork, and even trickier to know if there are any JNI/FFI-library-created raw pthreads running, the GC might be threaded, and checking for each of those things is still racy with my call to fork(2)".
You still have to make sure that that fork-safety invariants are true. But the effort to do that is extremely different depending on language platform.
Rust/C/C++ don't cleanly fit into either of those two (already mushy/subjective) categorizations, though. Whether forking is feasible in a given Rust/C/C++ codebase depends on what the code does and requires a tricky set of judgement calls and at-a-distance knowledge going forward to make sure that the codebase doesn't become fork-unsafe in harmful ways.
So long as you have something like nginx in front of your server. Otherwise your whole site can be taken down by a slowloris attack over a 33.6k modem.
To nitpick at least as of Apache HTTPD 1.3 ages ago it wasn't forking for every request, but had a pool of already forked worker processes with each handling one connection at a time but could handle an unlimited number of connections sequentially, and it could spawn or kill worker processes depending on load.
The same model is possible in Apache httpd 2.x with the "prefork" mpm.
I don't see anything in my comment that implied _when_ the forking happened so it's not really a nit :)
That's because Unix API used to assume fork() is extremely cheap. Threads were ugly performance hack second-class citizens - still are in some ways. This was indeed true on PDP-11 (just copy a <64KB disk file!), but as address spaces grew, it became prohibitively expensive to copy page tables, so programmers turned to multithreading. At then multicore CPUs became the norm, and multithreading on multicore CPUs meant any kind of copy-on-write required TLB shootdown, making fork() even more expensive. VMS (and its clone known as Windows NT) did it right from the start - processes are just resource containers, units execution are threads and all IO is async. But being technically superior doesn't outweighs the disadvantage of being proprietary.
It's also a pretty bold scheduler benchmark to be handling tens of thousands of processes or 1:1 thread wakeups, especially the further back in time you go considering fairness issues. And then that's running at the wrong latency granularity for fast I/O completion events across that many nodes so it's going to run like a screen door on a submarine without a lot of rethinking things.
Evented I/O works out pretty well in practice for the I and D cache, especially if you can affine and allocate things as the article states, and do similar natural alignments inside the kernel (i.e. RSS/consistent hashing).