Comment by wincy
4 hours ago
Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.
They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.
Some 32-bit counter somewhere used when in NVLINK overflows?
66 days + 12 hours are 5,745,600,000,000,000 ns. The log2 of this is 52.351...
Javascript and some other languages only have integer precision up to 52 bits then they switch to floating point.
Curious.
It's 32 bits of milliseconds, right? Hm, no, that would overflow much sooner (49.7 days).
2 replies →
Isn't 32bit counter 49 days? Assuming that one was counting milliseconds, at least.
Only remember that because that's the limit for Windows 95…
100ns intervals. My favorite part of that story is how long after Windows 95 was released before anybody discovered the bug.
1 reply →