Comment by ElectricalUnion
2 months ago
Sad reality is that the MI300x isn't a monolithic die, so the chiplets have internal bandwidth limitations (ofc less severe that using PCIe/nvlink).
In AMD own parlance, the "Modular Chiplet Platform" presents itself as either single-I-don't-care-about-speed-or-latency "Single Partition X-celerator" mode or in multiple-I-actually-totally-do-care-about-speed-and-latency-NUMA-like "Core Partitioned X-celerator" mode.
So you kinda still need to care what-loads-where.
I have never heard of a GPU where a deep understanding of how memory is managed was not critical towards getting the best performance.