← Back to context

Comment by bee_rider

3 days ago

Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but

* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…

* I’m wondering if explicit communication is better from one die to another in this sort of system.

With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?

The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.

I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?

  • If you query the NUMA layout tree, you have two sibling hw threads per core, then a cluster of 8 or 12 actual cores per die (up to 4 or 8 dies per socket), then the individual sockets (up to 2 sockets per machine).

    Before 8 cores per die (introduced in Zen 3, and retained in 4, 5 and 6), the Zen 1/+ and 2 series this would have been two sets of four cores instead of one set of eight (and a split L3 instead of a unified one). I can't remember if the split-CCX had its own NUMA layer in the tree or not, or if they were just iterated in pairs.

    • What I find myself wondering about is the performance impacts of cross-thread communication in this scenario. With the nested domains it seems like there should be different (and increasingly severe) performance implications for crossing each distinct boundary. Whereas the languages we write in and the programming models we employ don't seem particularly well suited to expressing how we want our code to adapt to such constraints at present, at least not in a generalized manner.

      I realize that HPC code can be customized to the specific device it will be run on but more widely deployed software is going to want to abstract these increasingly complex relationships.

      2 replies →

  • There should be plenty of existing programming models that can be reused because HPC used single-image multi-hop NUMA systems a lot before the Beowulf clusters took over.

    Even today, I think very large enterprise systems (where a single kernel runs on a single system that spans multiple racks) are built like this, too.