Comment by 8fingerlouie

2 years ago

I'm not well enough versed in the Linux kernel to comment on the post, but as a funny anecdote, 15 years ago, we were developing a system that relied heavily on using as many cores as possible simultaneously.

Performance was critical, and we had beefy developer machines (for the time), all with 8 cores. Development was done in C++, and as the project progressed the system performed very well, and more than exceeded our performance goals.

Fast forward a couple of years and it became time to put the thing into production on a massive 128 core Windows server. Much to our surprise the performance completely tanked when scaled to all 128 cores.

After much debugging, it turned out that the system spent most of its time context switching instead of doing actual work, and because we used atomic operations for message queue functionality (compare & swap), it effectively meant clearing the cache for every core working with/on that piece of heap memory, so every time a task passed a message to something else, it effectively reset the CPU cache, which would then have to be refetched from RAM. This was not (as big) a problem on the developer machines, as there were fewer cores and each task had more work queued up for it, but with 16 as many cores to work with, it simply ran out of tasks to do faster.

The "cure" was to limit the system to run on 8 cores just like the developer machines, and AFAIK it still runs in that way all these years later.

It's amazing how much "optimization" you can achieve on modern systems by simply fencing process(es) to run within a cache region, or within a NUMA node. Even my homelab server with a P-core cluster and two E-cores clusters benefits massively from some simple cpusets to keep each process running within a cluster to mitigate context switching. Each P-cores has its own L2 cache, but E-core clusters share L2. Unfortunately all three clusters share a LLC, so there's only so much you can do.