Comment by wincy

1 month ago

Crazy, so if I understand correctly, something with B200s and nvlink is causing issues where after 66 days and 12 hours of uptime, nvidia-smi and other jobs start failing, timing out, then once you restart the cluster it starts working again.

They suspect jobs will work if you only use 1 B200, but one person power cycled so wasn’t able to test it. Hopefully they won’t have to wait another 66 days for further troubleshooting.

12 comments

wincy

layla5alive 1 month ago

Some 32-bit counter somewhere used when in NVLINK overflows?

themafia 1 month ago
66 days + 12 hours are 5,745,600,000,000,000 ns. The log2 of this is 52.351...
Javascript and some other languages only have integer precision up to 52 bits then they switch to floating point.
Curious.
- loegta3 1 month ago
  
  Bingo! Someone decided to store timestamps in float64 which has 52 bit mantissa, and the time functions break when losing precision.
- loeg 1 month ago
  
  It's 32 bits of milliseconds, right? Hm, no, that would overflow much sooner (49.7 days).
  
  5 replies →
mook 1 month ago
Isn't 32bit counter 49 days? Assuming that one was counting milliseconds, at least.
Only remember that because that's the limit for Windows 95…
- repiret 1 month ago
  
  100ns intervals. My favorite part of that story is how long after Windows 95 was released before anybody discovered the bug.
  
  1 reply →