Comment by bee_rider

2 months ago

How do people use these things? Map MPI ranks to dies, instead of compute nodes?

10 comments

bee_rider

Yeah, there's an option to configure one NUMA node per CCD that can speed up some apps.

Gemma.cpp has nested thread pools, one per chiplet, and one across all chiplets. With such core counts it is quite important to minimize any kind of sharing, even RMW atomics.

markhahn 2 months ago

MPI is fine, but have you heard of threads?

bee_rider 2 months ago
Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but
* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…
* I’m wondering if explicit communication is better from one die to another in this sort of system.
- fc417fc802 2 months ago
  
  With 2 IO dies aren't there effectively 2 meta NUMA nodes with 4 leaf nodes each? Or am I off base there?
  The above doesn't even consider the possibility of multi-CPU systems. I suspect the existing programming models are quickly going to become insufficient for modeling these systems.
  I also find myself wondering how atomic instruction performance will fare on these. GPU ISA and memory model on CPU when?
  
  5 replies →