← Back to context

Comment by newswasboring

5 years ago

> Our original theory was that it had to do with cosmic rays causing bit-flips. This was a well known problem with installations in that area, having caused multi-month delays for some of the larger supercomputer installations in the area. But we'd already corrected for that.

Wow, I sense a more interesting story in here. Care to reveal how it was first found out and how common it actually is?

In a nutshell, cosmic rays causing bit-flips really is a thing, and it's more of a thing at higher altitude because of less atmosphere. It's rarely a problem at sea level. At higher altitude you really need to use ECC memory, and do some sort of scrubbing (in Linux it's called Error Detection And Correction or EDAC) to correct single-bit errors before they accumulate and some word somewhere becomes uncorrectable.

The incident that brought this home to a lot of people was at either NCAR or UCAR, both near Boulder. Whichever it was, they were installing a new system - tens of thousands of nodes - and had not been careful about the EDAC settings. Therefore, EDAC wasn't running often enough, and wasn't catching those single-bit errors. Therefore^2, uncorrectable errors were bringing down nodes constantly. According to rumor, this caused a huge delay and almost torched the entire project. It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).

  • > It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).

    Yeah, part of the nightmare of cosmic-ray bitflips (or any random bitflips, I suppose) is precisely that they don't look like anything. A server randomly locks up. A packet has a bad checksum (and is silently resent). A process gets into an unexpected state. That buggy batch job fails 1% more frequently than it used to. Nothing ever points to memory errors, except that there is no pattern.