← Back to context

Comment by tokyobreakfast

1 day ago

>even a cosmic ray flipping the "do not upload" bit in memory

Stats on this very likely scenario?

> IBM estimated in 1996 that one error per month per 256 MiB of RAM was expected for a desktop computer.

From the wikipedia article on "Soft error", if anyone wants to extrapolate.

  • That makes it vanishingly unlikely. On a 16GB RAM computer with that rate, you can expect 64 random bit flips per month.

    So roughly you could expect this happen roughly once every two hundred million years.

    Assuming there are about 2 billion Windows computers in use, that’s about 10 computers a year that experience this bit flip.

    • > 10 computers a year experience this bit flip

      That's wildly more than I would have naively expected to experience a specific bit-flip. Wow!

      2 replies →

    • I saw a computer with 'system33', 'system34' folders personally. Also you would never actually know it happened because... it's not ECC. And with ECC memory we replace a RAM stick every two-three months explicitly because ECC error count is too high.

      1 reply →

  • Rounding that to 1 error per 30 days per 256M, for 16G of RAM that would translate to 1 error roughly every half a day. I do not believe that at all, having done memory testing runs for much longer on much larger amounts of RAM. I've seen the error counters on servers with ECC RAM, which remain at 0 for many months; and when they start increasing, it's because something is failing and needs replaced. In my experience RAM failures are much rarer than for HDDs and SSDs.

Given enough computers, anything will happen. Apparently enough bit flips happen in domains (or their DNS resolution) that registering domains one bit away from the most popular ones (e.g. something like gnogle.com for google.com) might be worth it for bad actors. There was a story a few years ago, but I can't find it right now; perhaps someone will link it.

At google "more than 8% of DIMM memory modules were affected by errors per year" [0]

More on the topic: Single-event upset[1]

[0] https://en.wikipedia.org/wiki/ECC_memory

[1] https://en.wikipedia.org/wiki/Single-event_upset

  • At the time Google was taking RAM that had failed manufacturer QA that they had gotten for cheap and sticking it on DIMMs themselves and trying to self certify them.

  • > At google "more than 8% of DIMM memory modules were affected by errors per year"

    That's all errors including permanent hardware failure, not just transient bit flips or from cosmic rays.

    • You are right. Apologies for spreading false information(

      "We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode." [0]

      "Memory errors can be caused by electrical or magnetic interference (e.g. due to cosmic rays), can be due to problems with the hardware (e.g. a bit being permanently damaged), or can be the result of corruption along the data path between the memories and the processing elements. Memory errors can be classified into soft errors, which randomly corrupt bits but do not leave physical damage; and hard errors, which corrupt bits in a repeatable manner because of a physical defect."

      "Conclusion 7: Error rates are unlikely to be dominated by soft errors.

      We observe that CE [correctable errors] rates are highly correlated with system utilization, even when isolating utilization effects from the effects of temperature. In systems that do not use memory scrubbers this observation might simply reflect a higher detection rate of errors. In systems with memory scrubbers, this observations leads us to the conclusion that a significant fraction of errors is likely due to mechanism other than soft errors, such as hard errors or errors induced on the datapath. The reason is that in systems with memory scrubbers the reported rate of soft errors should not depend on utilization levels in the system. Each soft error will eventually be detected (either when the bit is accessed by an application or by the scrubber), corrected and reported. Another observation that supports Conclusion 7 is the strong correlation between errors in the same DIMM. Events that cause soft errors, such as cosmic radiation, are expected to happen randomly over time and not in correlation.

      Conclusion 7 is an interesting observation, since much previous work has assumed that soft errors are the dominating error mode in DRAM. Some earlier work estimates hard errors to be orders of magnitude less common than soft errors and to make up about 2% of all errors."

      [0] https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

It's "HN-likely" which translates to "almost never" in reality.

  • Happens all the time, in reality (even on the darkside). When the atmosphere fails (again, happening all the time), error correction usually handles the errant bits.

  • Especially since HN readers are more likely to be using ECC memory

  • if cosmic ray bit flips were so rare then ecc ram wouldn't be a thing.

    • ECC protects against more events than cosmic rays. Those events are much more likely, for instance magnetic/electric interferences or chip issues.

      3 replies →