With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?
The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.
I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?
If you query the NUMA layout tree, you have two sibling hw threads per core, then a cluster of 8 or 12 actual cores per die (up to 4 or 8 dies per socket), then the individual sockets (up to 2 sockets per machine).
Before 8 cores per die (introduced in Zen 3, and retained in 4, 5 and 6), the Zen 1/+ and 2 series this would have been two sets of four cores instead of one set of eight (and a split L3 instead of a unified one). I can't remember if the split-CCX had its own NUMA layer in the tree or not, or if they were just iterated in pairs.
There should be plenty of existing programming models that can be reused because HPC used single-image multi-hop NUMA systems a lot before the Beowulf clusters took over.
Even today, I think very large enterprise systems (where a single kernel runs on a single system that spans multiple racks) are built like this, too.
Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but
* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…
* I’m wondering if explicit communication is better from one die to another in this sort of system.
With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?
The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.
I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?
If you query the NUMA layout tree, you have two sibling hw threads per core, then a cluster of 8 or 12 actual cores per die (up to 4 or 8 dies per socket), then the individual sockets (up to 2 sockets per machine).
Before 8 cores per die (introduced in Zen 3, and retained in 4, 5 and 6), the Zen 1/+ and 2 series this would have been two sets of four cores instead of one set of eight (and a split L3 instead of a unified one). I can't remember if the split-CCX had its own NUMA layer in the tree or not, or if they were just iterated in pairs.
3 replies →
There should be plenty of existing programming models that can be reused because HPC used single-image multi-hop NUMA systems a lot before the Beowulf clusters took over.
Even today, I think very large enterprise systems (where a single kernel runs on a single system that spans multiple racks) are built like this, too.