Comment by nurettin
1 month ago
> we were hit with this on a 256 gpu b200 cluster -- at day 66 all our jobs started randomly failing
ouch
1 month ago
> we were hit with this on a 256 gpu b200 cluster -- at day 66 all our jobs started randomly failing
ouch
No comments yet
Contribute on Hacker News ↗