Comment by bmcahren
1 day ago
This was a good read and great work. Can't wait to see the performance tests.
Your write up connected some early knowledge from when I was 11 where I was trying to set up a database/backend and was finding lots of cgi-bin online. I realize now those were spinning up new processes with each request https://en.wikipedia.org/wiki/Common_Gateway_Interface
I remember when sendfile became available for my large gaming forum with dozens of TB of demo downloads. That alone was huge for concurrency.
I thought I had swore off this type of engineering but between this, the Netflix case of extra 40ms and the GTA 5 70% load time reduction maybe there is a lot more impactful work to be done.
https://netflixtechblog.com/life-of-a-netflix-partner-engine...
https://nee.lv/2021/02/28/How-I-cut-GTA-Online-loading-times...
It wasn't just CGI, every HTTP session was commonly a forked copy of the entire server in the CERN and Apache lineage! Apache gradually had better answers, but their API with common addons made it a bit difficult to transition so webservers like nginx took off which are built closer to the architecture in the article with event driven I/O from the beginning.
And there's nothing wrong with that for application workers. On *nix systems fork() is very fast, you can fork "the entire server" and the kernel will only COW your memory. As nginx etc. showed you can get better raw file serving performance with other models, but it's still a legitimate technique for application logic where business logic will drown out any process overhead.
Forking for anything other than calling exec is still a horrible idea (with special exceptions like shells). Forking is a very unsafe operation (you can easily share locks and files with the child process unless both your code and every library you use is very careful - for example, it's easy to get into malloc deadlocks with forked processes), and its performance depends a lot on how you actually use it.
So long as you have something like nginx in front of your server. Otherwise your whole site can be taken down by a slowloris attack over a 33.6k modem.
To nitpick at least as of Apache HTTPD 1.3 ages ago it wasn't forking for every request, but had a pool of already forked worker processes with each handling one connection at a time but could handle an unlimited number of connections sequentially, and it could spawn or kill worker processes depending on load.
The same model is possible in Apache httpd 2.x with the "prefork" mpm.
I don't see anything in my comment that implied _when_ the forking happened so it's not really a nit :)
That's because Unix API used to assume fork() is extremely cheap. Threads were ugly performance hack second-class citizens - still are in some ways. This was indeed true on PDP-11 (just copy a <64KB disk file!), but as address spaces grew, it became prohibitively expensive to copy page tables, so programmers turned to multithreading. At then multicore CPUs became the norm, and multithreading on multicore CPUs meant any kind of copy-on-write required TLB shootdown, making fork() even more expensive. VMS (and its clone known as Windows NT) did it right from the start - processes are just resource containers, units execution are threads and all IO is async. But being technically superior doesn't outweighs the disadvantage of being proprietary.
It's also a pretty bold scheduler benchmark to be handling tens of thousands of processes or 1:1 thread wakeups, especially the further back in time you go considering fairness issues. And then that's running at the wrong latency granularity for fast I/O completion events across that many nodes so it's going to run like a screen door on a submarine without a lot of rethinking things.
Evented I/O works out pretty well in practice for the I and D cache, especially if you can affine and allocate things as the article states, and do similar natural alignments inside the kernel (i.e. RSS/consistent hashing).
I'm sceptical of the efficiency gains with sendfile; seems marginal at best, even in the late 90s when it was at the height of popularity.
Then you don't understand the memory and protection model of a modern system very well.
sendfile effectively turns your user space file server into a control plane, and moves the data plane to where the data is eliminating copies between address spaces. This can be made congruent with I/O completions (i.e. Ethernet+IP and block) and made asynchronous so the entire thing is pumping data between completion events. Watch the Netflix video the author links in the post.
There is an inverted approach where you move all this into a single user address space, i.e. DPDK, but it's the same overall concept just a different who.
> seems marginal at best
Depends on the workload.
Normally you would go read() -> write() so:
1. Disk -> page cache (DMA)
2. Kernel -> user copy (read)
3. User -> kernel copy (write)
4. Kernel -> NIC (DMA)
sendfile():
1. Disk -> page cache (DMA)
No user space copies, kernel wires those pages straight to the socket
2. Kernel -> NIC (DMA)
So basically, it eliminates 1-2 memory copies along with the associated cache pollution and memory bandwidth overhead. If you are running high QPS web services where syscall and copy overheads dominate, for example CDNs/static file serving the gains can be really big. Based on my observations this can mean double digit reductions in CPU usage and up to ~2x higher throughput.
I understand the optimisation, I'm just saying I'm sceptical the optimisation is even that useful, like it seems it'd only kick in with pathological cases where kernel round trip time is really dominating; my gut reckons most applications just don't benefit. Caddy in the last few years got sendfile support and with it on and off and it usually you wouldn't see a discernible difference [1].
Which makes me sceptical for the argument for kTLS which is stated in the article; what benefit does offloading your crypto to the kernel provider (possibly making it more brittle). I've seen the author of haproxy say that performance he's seen has been only marginal, but did point out it was useful in that you can strace your process and see plaintext instead of ciphertext which is nice.
[1]: https://blog.tjll.net/reverse-proxy-hot-dog-eating-contest-c...