Comment by wincy

4 hours ago

Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.

They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.

Some 32-bit counter somewhere used when in NVLINK overflows?

  • 66 days + 12 hours are 5,745,600,000,000,000 ns. The log2 of this is 52.351...

    Javascript and some other languages only have integer precision up to 52 bits then they switch to floating point.

    Curious.

  • Isn't 32bit counter 49 days? Assuming that one was counting milliseconds, at least.

    Only remember that because that's the limit for Windows 95…