Comment by Salgat

8 hours ago

I'm talking about close together in the cache. If a threadpool manager is hinted that 4 threads are going to share a lot of memory, they can be allocated on the same l2 cache. And no matter what, you're trusting software developers either way, whether it be at the app level, the language/runtime level, or the operating system level.

NUMA aware threading is somewhat rare but it does exist.

Its just reaching into the high arts of high-performance that fewer-and-fewer programmers know about. I myself am not an HPC expert, I just like to study this stuff on the side as a hobby.

So NUMA-awareness is when your code knows that &variable1 is located in one physical location, while &variable2 is somewhere else.

This is possible because NUMA-aware allocators (numa_alloc in Linux, VirtualAlloc in Windows) can take parameters that guarantee an allocation within a particular NUMA zone.

Now that you know certain variables are tied together in physical locations, you can also tie threads together with affinity to those same NUMA locations. And with a bit of effort, you can ensure that threads that are in one workpool share the same NUMA zones.

---------

Now code-awareness of shared caches is less common. But following the same models of "abstracted work pools of thread-affinity + NUMA awareness of data", programmers have been able to ensure Zen1 cores to be working together with the same L3 cache.

L2 cache with E-cores is new, but not a new concept in general. (IE: the same mechanisms and abstractions we used for thread-affinity checks on Zen cores sharing L3 cache, or multi-socket CPUs being NUMA Aware... all would still work for L2 cache).

I don't know if the libraries support that. But I bet Intel's library (TBB) and their programmers are working on keeping their abstractions clean and efficient.

  • > I don't know if the libraries support that. But I bet Intel's library (TBB) and their programmers are working on keeping their abstractions clean and efficient.

    Intel can declare in ACPI a set of nodes, the distances between nodes, and then Linux/libnuma/etc pick it up.

    So, e.g. in AMD's SLIT tables, the local node is 10; within the same partition are 11; within the same socket are 12; distant sockets are >=20.

    There's fancier, more detailed tables (e.g. HMAT) and some code out there that uses them, but it's kind of beyond the scope of libnuma.

> you're trusting software developers either way, whether it be at the app level, the language/runtime level, or the operating system level.

I trust systems to do better based on observed behavior rather than a software engineer's guess of how it will be scheduled. Who knows if, in a given use case, the program is a "small" part of the system or a "large" part that should get preferential placement and scheduling.

> If a threadpool manager is hinted that 4 threads are going to share a lot of memory, they can be allocated on the same l2 cache.

And so this is kind of a weird thing: we know we're going to be performance critical and we need things to be forced to be adjacent... but we don't know the exact details of the hardware we're running on. (Else, just numa_bind and be done...)

  • The beauty is that you don't care what hardware you run on, all you're annotating are very useful but generic properties such as which threads are sharing a lot of memory, or perhaps that a thread should have highest performance priority so that internally it stays on p cores instead of the more scalable e cores. Very simple optional hints.

    • > should have highest performance priority so that internally it stays on p cores

      Everything will decide that it wants P cores; it's not punished for battery or energy impact, and wants to win over other applications for users to have a better experience with it.

      And even if not made in bad faith, it doesn't know what else is running on the system.

      Also these decisions tend to be unduly influenced by microbenchmarks and then don't apply to the real system.

      > which threads are sharing a lot of memory

      But if they're not super active, should the scheduler really change what it's doing? And doesn't the size of that L2 matter? It doesn't matter if e.g. the stuff is going to get churned out before there's a benefit from that sharing.

      In the end, if you don't know pretty specific details of the environment you'll run on: what the hardware is like, what loading is like, what data set size is like, and what else will be running on the machine -- it is probably better to leave this decision to the scheduler.

      If you do know all those things, and it's worth tuning this stuff in depth-- odds are you're HPC and you know what the machine is like.