← Back to context

Comment by notacoward

5 years ago

Here's the craziest one that actually happened to me.

The company I worked for had installed what's best described as a mini-supercomputer (though we avoided the term) at a site in Boulder. We started getting reports of failures on the internal communication links between the compute nodes ... only at high load, late in the day. Since I was responsible for the software that managed those links, I got sent out. Two days in a row, after trying everything we could to reproduce or debug the problem, I got paged minutes after I'd left (and couldn't get back in) to tell me that it had failed again.

Our original theory was that it had to do with cosmic rays causing bit-flips. This was a well known problem with installations in that area, having caused multi-month delays for some of the larger supercomputer installations in the area. But we'd already corrected for that. It wasn't the problem.

What it ultimately turned out to be was airflow and cooling. The air's thinner up there, so it carries less heat. But it wasn't the processors or links that were getting too hot. It was the power supply. When a power supply gets warmer it gets less efficient. Earlier in the day or with shorter runs as we tried different things this wasn't enough to cause a problem. With it being warmer later in the day, continuous load for longer periods was enough to cause slight brown-outs, and those were making our links flaky. And of course it would always restart just fine because it had cooled down a bit.

The fix ended up being one line in a fan-controller config.

I had a loaner machine (RS-6000 minicomputer) that would have unrecoverable ECC errors when the cover was on. The tech would come and try to diagnose it, but with the cover off, everything would work fine. He'd swap the memory anyway and put the cover back on. within a few hours the memory bank would be failing again. Turned out the machine had been a loaner in a lab where it had acquired some alpha-emitting goo on the inside of the side panel. The lab had just run it with the side panel off to solve the problem, never noticing the goo, never mentioning it to IBM when they packed it up to ship.

It's a long story but the gist is after multiple board swaps, realizing we'd isolated the panel as the fault, I noticed the goo and on a hunch checked it with a scintillator, deducing it was alpha when cardboard blocked it. Turns out the ultra-precious-metal IBM heat sink on the board had an open path that effectively channeled the alpha particles into one of those multi-chip carrier thingies, which featured exposed chips.

As for why I had a scintillator lounging in my desk at a portfolio management company, don't ask. Let's just note the iconic IT anti-hero of that era was the Bastard Operator From Hell, and leave it at that.

Unrelated to a strange bug story or anything but you just reminded me of when I was also helping someone set up a, as you called it, mini-supercomputer. It was to do quantum simulations. We were setting it up and the researcher who was going to use it made the root user name skynet. Now I know that joke has probably been played out at campuses around the world but it just seems unnecessary to tempt the fates like that.

> Our original theory was that it had to do with cosmic rays causing bit-flips. This was a well known problem with installations in that area, having caused multi-month delays for some of the larger supercomputer installations in the area. But we'd already corrected for that.

Wow, I sense a more interesting story in here. Care to reveal how it was first found out and how common it actually is?

  • In a nutshell, cosmic rays causing bit-flips really is a thing, and it's more of a thing at higher altitude because of less atmosphere. It's rarely a problem at sea level. At higher altitude you really need to use ECC memory, and do some sort of scrubbing (in Linux it's called Error Detection And Correction or EDAC) to correct single-bit errors before they accumulate and some word somewhere becomes uncorrectable.

    The incident that brought this home to a lot of people was at either NCAR or UCAR, both near Boulder. Whichever it was, they were installing a new system - tens of thousands of nodes - and had not been careful about the EDAC settings. Therefore, EDAC wasn't running often enough, and wasn't catching those single-bit errors. Therefore^2, uncorrectable errors were bringing down nodes constantly. According to rumor, this caused a huge delay and almost torched the entire project. It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).

    • > It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).

      Yeah, part of the nightmare of cosmic-ray bitflips (or any random bitflips, I suppose) is precisely that they don't look like anything. A server randomly locks up. A packet has a bad checksum (and is silently resent). A process gets into an unexpected state. That buggy batch job fails 1% more frequently than it used to. Nothing ever points to memory errors, except that there is no pattern.