Comment by butterisgood
1 day ago
Where do people get the idea that one thread per core is correct on a system that deals with time slices?
In my experience “oversubscribing” threads to cores (more threads than cores) provides a wall-clock time benefit.
I think one thread per core would work better without preemptive scheduling.
But then we aren’t talking about Unix.
Isolating a core and then pinning a single thread is the way to go to get both low latency and high throughput, sacrificing efficiency.
This works fine on Linux, and common approach for trading systems where it’s fine to oversubscribe a bunch of cores for this type of stuff. The cores are mostly busy spinning and doing nothing, so it’s very inefficient in terms of actual work, but great for latency and throughput when you need it.
I just wish people who give this advice for 1 thread per core would "expand their reasoning" or "show the work".
It's not blanket good advice for all things.
It is definitely not good advice for all things. For workloads that are either end of the CPU/IO spectrum (e.g. almost all waiting on IO or almost all doing CPU work) it can be a huge win as you can get very good L1 cache utilization, are not context-switching and don't need to handle thread synchronization in your code because not state is shared between threads.
For workloads that are a mix of IO and non-trivial CPU work, it can still work but is much, much harder to get right.
Check out Scylla and its underlying framework Seastar. They expand their reasoning and show the work.
A mistake people make with thread-per-core (TPC) architecture is thinking you can pick and choose the parts you find convenient, when in reality it is much closer to "all or nothing". It may be worse to half-ass a TPC implementation than to not use TPC at all. However, TPC is more efficient in just about all contexts if you do it correctly.
Most developers are unfamiliar with the design idioms for TPC e.g. how to properly balance and shed load between cores.
One thread per core if you're CPU-bound and not IO-bound.
In this very specific case, it seems as though the vast majority of the webserver's work is asynchronous and event-based, so the actual webserver is never waiting on I/O input or output - once it's ready you dump it somewhere the kernel can get to it and move on to the next request if there is one.
I think this gets this specific project close to the platonic ideal of a one-thread-per-core workload if indeed you're never waiting on I/O or any syscalls, but I feel as though it should come with extreme caveats of "this is almost never how the real world works so don't go artificially limiting your application to `nproc` threads without actually testing real-world use cases first".
But, your CPU availability is time sliced... So, why is not "more than one thread per core" equivalent to "more CPU" (my point is, sometimes it is...)
https://github.com/rminnich/9front/tree/ron_nix
Has Ron Minnich's port of "Nix" (not NixOS as you may know it), to 9front.
The entire point of this is to disallow the kernel pre-empting and switching out CPU cores that should be dedicated to an "application". (Application Cores).
One could imagine this arrangement plus io_uring would be awfully nice.
In the case of io_uring, one user thread per core is not a bad rule of thumb given that the kernel side is using a pool of worker threads.