Comment by fooker 2 months ago That's exactly what Nvidia is doing with tensor cores. 4 comments fooker Reply bjourne 2 months ago Except the native width of Tensor Cores are about 8-32 (depending on scalar type), whereas the width of TPUs is up to 256. The difference in scale is massive. neilmovva 2 months ago I think Hopper's native matmul tile is 64x64, and Blackwell is 128x128.see this blog for a reference on Blackwell:https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwe... fooker 2 months ago If it turns out to be useful, Nvidia can't just tweak a parameter in their verilog and declare victory?If not, what's fundamentally difficult about doing 32 vs 256 here? saagarjha 2 months ago Nobody cares about width; they care about TFLOPs.
bjourne 2 months ago Except the native width of Tensor Cores are about 8-32 (depending on scalar type), whereas the width of TPUs is up to 256. The difference in scale is massive. neilmovva 2 months ago I think Hopper's native matmul tile is 64x64, and Blackwell is 128x128.see this blog for a reference on Blackwell:https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwe... fooker 2 months ago If it turns out to be useful, Nvidia can't just tweak a parameter in their verilog and declare victory?If not, what's fundamentally difficult about doing 32 vs 256 here? saagarjha 2 months ago Nobody cares about width; they care about TFLOPs.
neilmovva 2 months ago I think Hopper's native matmul tile is 64x64, and Blackwell is 128x128.see this blog for a reference on Blackwell:https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwe...
fooker 2 months ago If it turns out to be useful, Nvidia can't just tweak a parameter in their verilog and declare victory?If not, what's fundamentally difficult about doing 32 vs 256 here?
Except the native width of Tensor Cores are about 8-32 (depending on scalar type), whereas the width of TPUs is up to 256. The difference in scale is massive.
I think Hopper's native matmul tile is 64x64, and Blackwell is 128x128.
see this blog for a reference on Blackwell:
https://hazyresearch.stanford.edu/blog/2025-03-15-tk-blackwe...
If it turns out to be useful, Nvidia can't just tweak a parameter in their verilog and declare victory?
If not, what's fundamentally difficult about doing 32 vs 256 here?
Nobody cares about width; they care about TFLOPs.