Comment by tokyobreakfast

1 month ago

>even a cosmic ray flipping the "do not upload" bit in memory

Stats on this very likely scenario?

28 comments

tokyobreakfast

strbean 1 month ago

> IBM estimated in 1996 that one error per month per 256 MiB of RAM was expected for a desktop computer.

From the wikipedia article on "Soft error", if anyone wants to extrapolate.

d1sxeyes 1 month ago
That makes it vanishingly unlikely. On a 16GB RAM computer with that rate, you can expect 64 random bit flips per month.
So roughly you could expect this happen roughly once every two hundred million years.
Assuming there are about 2 billion Windows computers in use, that’s about 10 computers a year that experience this bit flip.
- eszed 1 month ago
  
  > 10 computers a year experience this bit flip
  That's wildly more than I would have naively expected to experience a specific bit-flip. Wow!
  
  2 replies →
- justsomehnguy 1 month ago
  
  I saw a computer with 'system33', 'system34' folders personally. Also you would never actually know it happened because... it's not ECC. And with ECC memory we replace a RAM stick every two-three months explicitly because ECC error count is too high.
  
  2 replies →
userbinator 1 month ago

Rounding that to 1 error per 30 days per 256M, for 16G of RAM that would translate to 1 error roughly every half a day. I do not believe that at all, having done memory testing runs for much longer on much larger amounts of RAM. I've seen the error counters on servers with ECC RAM, which remain at 0 for many months; and when they start increasing, it's because something is failing and needs replaced. In my experience RAM failures are much rarer than for HDDs and SSDs.

drysine 1 month ago

At google "more than 8% of DIMM memory modules were affected by errors per year" [0]

More on the topic: Single-event upset[1]

[0] https://en.wikipedia.org/wiki/ECC_memory

[1] https://en.wikipedia.org/wiki/Single-event_upset

monocasa 1 month ago

At the time Google was taking RAM that had failed manufacturer QA that they had gotten for cheap and sticking it on DIMMs themselves and trying to self certify them.
Aloisius 1 month ago
> At google "more than 8% of DIMM memory modules were affected by errors per year"
That's all errors including permanent hardware failure, not just transient bit flips or from cosmic rays.
- drysine 1 month ago
  
  You are right. Apologies for spreading false information(
  "We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode." [0]
  "Memory errors can be caused by electrical or magnetic interference (e.g. due to cosmic rays), can be due to problems with the hardware (e.g. a bit being permanently damaged), or can be the result of corruption along the data path between the memories and the processing elements. Memory errors can be classified into soft errors, which randomly corrupt bits but do not leave physical damage; and hard errors, which corrupt bits in a repeatable manner because of a physical defect."
  "Conclusion 7: Error rates are unlikely to be dominated by soft errors.
  We observe that CE [correctable errors] rates are highly correlated with system utilization, even when isolating utilization effects from the effects of temperature. In systems that do not use memory scrubbers this observation might simply reflect a higher detection rate of errors. In systems with memory scrubbers, this observations leads us to the conclusion that a significant fraction of errors is likely due to mechanism other than soft errors, such as hard errors or errors induced on the datapath. The reason is that in systems with memory scrubbers the reported rate of soft errors should not depend on utilization levels in the system. Each soft error will eventually be detected (either when the bit is accessed by an application or by the scrubber), corrected and reported. Another observation that supports Conclusion 7 is the strong correlation between errors in the same DIMM. Events that cause soft errors, such as cosmic radiation, are expected to happen randomly over time and not in correlation.
  Conclusion 7 is an interesting observation, since much previous work has assumed that soft errors are the dominating error mode in DRAM. Some earlier work estimates hard errors to be orders of magnitude less common than soft errors and to make up about 2% of all errors."
  [0] https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

homebrewer 1 month ago

Given enough computers, anything will happen. Apparently enough bit flips happen in domains (or their DNS resolution) that registering domains one bit away from the most popular ones (e.g. something like gnogle.com for google.com) might be worth it for bad actors. There was a story a few years ago, but I can't find it right now; perhaps someone will link it.

pixl97 1 month ago
https://www.youtube.com/watch?v=aT7mnSstKGs
Was in DEFCON19.
- homebrewer 1 month ago
  
  Great, thanks. Here's a discussion on this site:
  https://news.ycombinator.com/item?id=4800489
lanyard-textile 1 month ago
A very old game speedrun -- of the era that speedruns weren't really a "thing" like they are today -- apparently greatly benefited from a hardware bit flip, and it was only recently discovered.
Can't find an explanatory video though :(
- direwolf20 1 month ago
  
  The Tick Tock Clock upwarp in Super Mario 64. All evidence that exists of it happening is a video recording. The most similar recording was generated by flipping a single bit in Mario's Y position, compared to other possibilities that were tested, such as warping Mario up to the closest ceiling directly above him.
  
  1 reply →

halfmatthalfcat 1 month ago

It's "HN-likely" which translates to "almost never" in reality.

Supermancho 1 month ago

Happens all the time, in reality (even on the darkside). When the atmosphere fails (again, happening all the time), error correction usually handles the errant bits.
patja 1 month ago

Especially since HN readers are more likely to be using ECC memory
smegger001 1 month ago
if cosmic ray bit flips were so rare then ecc ram wouldn't be a thing.
- Sayrus 1 month ago
  
  ECC protects against more events than cosmic rays. Those events are much more likely, for instance magnetic/electric interferences or chip issues.
  
  3 replies →