← Back to context

Comment by atq2119

2 days ago

It's misleading to compare a desktop GPU against a data center GPU on these metrics. Blackwell data center tenor cores are different from Blackwell consumer tensor cores, and same for the AMD side.

Also, the size of the native / atomic matrix fragment size isn't relevant for memory bandwidth because you can always build larger matrices out of multiple fragments in the register file. A single matrix fragment is read from memory once and used in multiple matmul instructions, which has the same effect on memory bandwidth as using a single larger matmul instruction.