Comment by atq2119
2 days ago
It's misleading to compare a desktop GPU against a data center GPU on these metrics. Blackwell data center tenor cores are different from Blackwell consumer tensor cores, and same for the AMD side.
Also, the size of the native / atomic matrix fragment size isn't relevant for memory bandwidth because you can always build larger matrices out of multiple fragments in the register file. A single matrix fragment is read from memory once and used in multiple matmul instructions, which has the same effect on memory bandwidth as using a single larger matmul instruction.
No comments yet
Contribute on Hacker News ↗