Comment by amitk2405

14 hours ago

The part that surprised me is how much TPUs gain from the systolic array design. It basically cuts down the constant memory shuffling that GPUs have to do, so more of the chip’s time is spent actually computing.

The downside is the same thing that makes them fast: they’re very specialized. If your code already fits the TPU stack (JAX/TensorFlow), you get great performance per dollar. If not, the ecosystem gap and fear of lock-in make GPUs the safer default.