Comment by sailingparrot
12 hours ago
> How much maintenance do you need?
A lot. As someone that has been responsible for trainings with up to 10K GPUs, things fail all the time. By all the time I don't mean every few weeks, I mean daily. From disk failings, to GPU overheating, to infiniband optical connectors not being correctly fastened and disconnecting randomly, we have to send people to manually fix/debug things in the datacenter all the time.
If one GPU fails, you essentially lose the entire node (so 8 GPUs), so if your strategy is to just turn off whatever fails forever and not deal with it, it's gonna get very expensive very fast.
And thats in an environment where temperature is very well controlled and where you don't have to put your entire cluster through 4 Gs and insane vibrations during take off.
No comments yet
Contribute on Hacker News ↗