Comment by krasin
2 days ago
I refer to the RDNA4 instruction set manual ([1]), page 90, Table 41. WMMA Instructions.
They support FP8/BF8 with F32 accumulate and also IU4 with I32 accumulate. The max matrix size is 16x16. For comparison, NVIDIA Blackwell GB200 supports matrices up to 256x32 for FP8 and 256x96 for NVFP4.
This matters for overall throughput, as feeding a bigger matrix unit is actually cheaper in terms of memory bandwidth, as the number of FLOPs grows O(n^2) when increasing the size of a systolic array, while the number of inputs/outputs as O(n).
1. https://www.amd.com/content/dam/amd/en/documents/radeon-tech...
2. https://semianalysis.com/2025/06/23/nvidia-tensor-core-evolu...
It's misleading to compare a desktop GPU against a data center GPU on these metrics. Blackwell data center tenor cores are different from Blackwell consumer tensor cores, and same for the AMD side.
Also, the size of the native / atomic matrix fragment size isn't relevant for memory bandwidth because you can always build larger matrices out of multiple fragments in the register file. A single matrix fragment is read from memory once and used in multiple matmul instructions, which has the same effect on memory bandwidth as using a single larger matmul instruction.