Comment by dontreact

3 years ago

TPUs are hard to use outside of Google, (I have tried in and out of Google). I think the situation is improving, but the efficiency from using a large pod is really remarkable. What topology did you train AI models on? Within Google it's common to train across a whole pod or even across multiple pods 8x16x16 is the largest currently.

Also if Tesla actually published numbers on an MlPerf benchmark, I would be more inclined to believe claims about 36x better efficiency.

https://mlcommons.org/en/training-normal-20/

The fastest times I'm seeing here for image classification and for object detection (not the same, but probably closest proxy out of the tasks benchmarked) are for TPUs.

To know who has better training technology I don't think you should be using a cost-efficiency metric, it seems to me the best thing to use would be who can train networks the fastest. Cost metrics are easy to game especially if you are the ones making the chips (Of course them making chips is cheaper than buying Nvidia chips for them once the capital investment is made). To measure who is ahead in technology, I think you have to look at who can train models the fastest, and right now as far as I can tell, TPUs are unbeat for this. (Although practically speaking it's hard to pull off these large topology things externally and there are also other caveats with ML perf related to how the training setups are optimized, but nonetheless, it's a better signal than what Elon says in a presentation :) )