Comment by mike_hearn
7 hours ago
The silicon itself does wear out. Dopant migration or something, I'm not an expert. Three years is probably too low but they do die. GPUs dying during training runs was a major engineering problem that had to be tackled to build LLMs.
> GPUs dying during training runs was a major engineering problem that had to be tackled to build LLMs.
The scale there is a little bit different. If you're training an LLM with 10,000 tightly-coupled GPUs where one failure could kill the entire job, then your mean time to failure drops by that factor of 10,000. What is a trivial risk in a single-GPU home setup would become a daily occurrence at that scale.