← Back to context

Comment by qcnguy

11 hours ago

But for research you often don't have to max out the hardware right away.

And the question is what do programs that max out Ironwood look like vs TPU programs written 5 years ago?

Sure, but you do have to do it pretty quick. Let’s pick a H100. You’ve probably heard that just writing scalar code is leaving 90+% of the flops idle. But even past that, if you’re using the tensor core but using the wrong instructions you’re basically capped at 300-400 TFLOPS of the 1000 the hardware supports. If using the new instructions but poorly you’re probably not going to hit even 500 TFLOPS. That’s just barely better than the previous generation you paid a bunch of money to replace.