Comment by markhahn

4 days ago

MPI is fine, but have you heard of threads?

Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but

* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…

* I’m wondering if explicit communication is better from one die to another in this sort of system.

  • With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?

    The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.

    I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?

    • If you query the NUMA layout tree, you have two sibling hw threads per core, then a cluster of 8 or 12 actual cores per die (up to 4 or 8 dies per socket), then the individual sockets (up to 2 sockets per machine).

      Before 8 cores per die (introduced in Zen 3, and retained in 4, 5 and 6), the Zen 1/+ and 2 series this would have been two sets of four cores instead of one set of eight (and a split L3 instead of a unified one). I can't remember if the split-CCX had its own NUMA layer in the tree or not, or if they were just iterated in pairs.

      3 replies →

    • There should be plenty of existing programming models that can be reused because HPC used single-image multi-hop NUMA systems a lot before the Beowulf clusters took over.

      Even today, I think very large enterprise systems (where a single kernel runs on a single system that spans multiple racks) are built like this, too.