Comment by yjftsjthsd-h

5 years ago

> It's easy to say in retrospect that they should have checked the EDAC settings first, but as it happened they probably only got to that after multiple rounds of blaming the vendor for flaky hardware (which would generally be the more likely cause especially when you're on the bleeding edge).

Yeah, part of the nightmare of cosmic-ray bitflips (or any random bitflips, I suppose) is precisely that they don't look like anything. A server randomly locks up. A packet has a bad checksum (and is silently resent). A process gets into an unexpected state. That buggy batch job fails 1% more frequently than it used to. Nothing ever points to memory errors, except that there is no pattern.