← Back to context

Comment by hdjrudni

10 hours ago

They can run these things at 100% utilization for 3 years straight? And not burn them out? That's impressive.

Not really. GPUs are stateless so your bounded lifetime regardless of how much you use them is the lifetime of the shitties capacitor on there (essentially). Modulo a design defect or manufacturing defect, I’d expect a usable lifetime of at least 10 years, well beyond the manufacturer’s desire to support the drivers for it (ie the sw should “fail” first).

  • The silicon itself does wear out. Dopant migration or something, I'm not an expert. Three years is probably too low but they do die. GPUs dying during training runs was a major engineering problem that had to be tackled to build LLMs.

    • > GPUs dying during training runs was a major engineering problem that had to be tackled to build LLMs.

      The scale there is a little bit different. If you're training an LLM with 10,000 tightly-coupled GPUs where one failure could kill the entire job, then your mean time to failure drops by that factor of 10,000. What is a trivial risk in a single-GPU home setup would become a daily occurrence at that scale.