Comment by markhahn

1 month ago

MPI is fine, but have you heard of threads?

7 comments

markhahn

Reply

bee_rider 1 month ago

Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but

* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…

* I’m wondering if explicit communication is better from one die to another in this sort of system.

fc417fc802 1 month ago
With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?
The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.
I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?
- DiabloD3 1 month ago
  
  If you query the NUMA layout tree, you have two sibling hw threads per core, then a cluster of 8 or 12 actual cores per die (up to 4 or 8 dies per socket), then the individual sockets (up to 2 sockets per machine).
  Before 8 cores per die (introduced in Zen 3, and retained in 4, 5 and 6), the Zen 1/+ and 2 series this would have been two sets of four cores instead of one set of eight (and a split L3 instead of a unified one). I can't remember if the split-CCX had its own NUMA layer in the tree or not, or if they were just iterated in pairs.
  
  3 replies →
- fweimer 1 month ago
  
  There should be plenty of existing programming models that can be reused because HPC used single-image multi-hop NUMA systems a lot before the Beowulf clusters took over.
  Even today, I think very large enterprise systems (where a single kernel runs on a single system that spans multiple racks) are built like this, too.