Comment by jsnell

4 months ago

> Nobody who is doing this is willing to come clean with hard numbers but there are data points, for example from Meta and (very unofficially) Google.

The Meta link does not support the point. It's actually implying a MTBF of over 5 years at 90% utilizization even if you assume there's no bathtub curve. Pretty sure that lines up with the depreciation period.

The Google link is even worse. It links to https://www.tomshardware.com/pc-components/gpus/datacenter-g...

That article makes a big claim, does not link to any source. It vaguely describes the source, but nobody who was actually in that role would describe themselves as the "GenAI principal architect at Alphabet". Like, those are not the words they would use. It would also be pointless to try to stay anonymous if that really were your title.

It looks like the ultimate source of the quote is this Twitter screenshot of an unnamed article (whose text can't be found with search engines): https://x.com/techfund1/status/1849031571421983140

That is not merely an unofficial source. That is just made up trash that the blog author just lapped up despite its obviously unreliable nature, since it confirmed his beliefs.

Besides, if the claim about GPU wear-and-tear was true, this would show up consistently in GPUs sourced from cryptomining (which was generally done in makeshift compute centers with terrible cooling and other environmental factors) and it just doesn't.

> It's actually implying a MTBF of over 5 years [...] Pretty sure that lines up with the depreciation period.

You're assuming this is normal, for the MTBF to line up with the depreciation schedule. But the MTBF of data center hardware is usually quite a bit longer than the depreciation schedule right? If I recall correctly, for servers it's typically double or triple, roughly. Maybe less for GPUs, I'm not directly familiar, but a quick web search suggests these periods shouldn't line up for GPUs either.

On top of that, Google isn't using NVIDIA GPUs, they have their own TPU.

  • Google is using nVidia GPUs. More than that, I'd expect Google to still be something like 90% on nVidia GPUs. You can't really check of course. Maybe I'm an idiot and it's 50%.

    But you can see how that works: go to colab.research.google.com. Type in some code ... "!nvidia-smi" for instance. Click on the down arrow next to "connect", and select change runtime type. 3 out of 5 GPU options are nVidia GPUs.

    Frankly, unless you rewrite your models you don't really have a choice but using nVidia GPUs, thanks to, ironically, Facebook (authors of pytorch). There is pytorch/XLA automatic translation to TPU but it doesn't work for "big" models. And as a point of advice: you want stuff to work on TPUs? Do what Googlers do: use Jax ( https://github.com/jax-ml/jax ), oh, and look at the commit logs of that repository to get your mind blown btw.

    In other words, Google rents out nVidia GPUs to their cloud customers (with the hardware physically present in Google datacenters).

    • > Frankly, unless you rewrite your models you don't really have a choice but using nVidia GPUs, thanks to, ironically, Facebook (authors of pytorch). There is pytorch/XLA automatic translation to TPU but it doesn't work for "big" models. And as a point of advice: you want stuff to work on TPUs?

      I don't understand what you mean, most models aren't anywhere near big in terms of code complexity, once you have the efficient primitives to build on (like you have an efficient hardware-accerated matmul, backprop, flash attention, etc.) these models are in the sub-thousand LoC territory and you can even vibe-convert from one environment to another.

      That's kind of a shock to realize how simple the logic behind LLMs is.

      I still agree with you, Google is most likely still using Nvidia chips in addition to TPUs.

      1 reply →