Comment by babl-yc

3 hours ago

I don't know if you can generally say that "LLM training is faster on TPUs vs GPUs". There is variance among LLM architectures, TPU cluster sizes, GPU cluster sizes...

They are both designed to do massively parallel operations. TPUs are just a bit more specific to matrix multiply+adds while GPUs are more generic.