Comment by Neywiny

4 days ago

32 cores on a die, 256 on a package. Still stunning though

How do people use these things? Map MPI ranks to dies, instead of compute nodes?

  • Gemma.cpp has nested thread pools, one per chiplet, and one across all chiplets. With such core counts it is quite important to minimize any kind of sharing, even RMW atomics.

  • Yeah, there's an option to configure one NUMA node per CCD that can speed up some apps.

  • MPI is fine, but have you heard of threads?

    • Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but

      * It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…

      * I’m wondering if explicit communication is better from one die to another in this sort of system.

      6 replies →