← Back to context Comment by Neywiny 5 days ago 32 cores on a die, 256 on a package. Still stunning though 11 comments Neywiny Reply bee_rider 5 days ago How do people use these things? Map MPI ranks to dies, instead of compute nodes? janwas 1 day ago Gemma.cpp has nested thread pools, one per chiplet, and one across all chiplets. With such core counts it is quite important to minimize any kind of sharing, even RMW atomics. wmf 5 days ago Yeah, there's an option to configure one NUMA node per CCD that can speed up some apps. markhahn 4 days ago MPI is fine, but have you heard of threads? bee_rider 4 days ago Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…* I’m wondering if explicit communication is better from one die to another in this sort of system. 6 replies →
bee_rider 5 days ago How do people use these things? Map MPI ranks to dies, instead of compute nodes? janwas 1 day ago Gemma.cpp has nested thread pools, one per chiplet, and one across all chiplets. With such core counts it is quite important to minimize any kind of sharing, even RMW atomics. wmf 5 days ago Yeah, there's an option to configure one NUMA node per CCD that can speed up some apps. markhahn 4 days ago MPI is fine, but have you heard of threads? bee_rider 4 days ago Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…* I’m wondering if explicit communication is better from one die to another in this sort of system. 6 replies →
janwas 1 day ago Gemma.cpp has nested thread pools, one per chiplet, and one across all chiplets. With such core counts it is quite important to minimize any kind of sharing, even RMW atomics.
wmf 5 days ago Yeah, there's an option to configure one NUMA node per CCD that can speed up some apps.
markhahn 4 days ago MPI is fine, but have you heard of threads? bee_rider 4 days ago Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…* I’m wondering if explicit communication is better from one die to another in this sort of system. 6 replies →
bee_rider 4 days ago Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…* I’m wondering if explicit communication is better from one die to another in this sort of system. 6 replies →
How do people use these things? Map MPI ranks to dies, instead of compute nodes?
Gemma.cpp has nested thread pools, one per chiplet, and one across all chiplets. With such core counts it is quite important to minimize any kind of sharing, even RMW atomics.
Yeah, there's an option to configure one NUMA node per CCD that can speed up some apps.
MPI is fine, but have you heard of threads?
Sure, the conventional way of doing things is OpenMP on a node and MPI across nodes, but
* It just seems like a lot of threads to wrangle without some hierarchy. Nested OpenMP is also possible…
* I’m wondering if explicit communication is better from one die to another in this sort of system.
6 replies →