Comment by drysine
17 hours ago
You are right. Apologies for spreading false information(
"We provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode." [0]
"Memory errors can be caused by electrical or magnetic interference (e.g. due to cosmic rays), can be due to problems with the hardware (e.g. a bit being permanently damaged), or can be the result of corruption along the data path between the memories and the processing elements. Memory errors can be classified into soft errors, which randomly corrupt bits but do not leave physical damage; and hard errors, which corrupt bits in a repeatable manner because of a physical defect."
"Conclusion 7: Error rates are unlikely to be dominated by soft errors.
We observe that CE [correctable errors] rates are highly correlated with system utilization, even when isolating utilization effects from the effects of temperature. In systems that do not use memory scrubbers this observation might simply reflect a higher detection rate of errors. In systems with memory scrubbers, this observations leads us to the conclusion that a significant fraction of errors is likely due to mechanism other than soft errors, such as hard errors or errors induced on the datapath. The reason is that in systems with memory scrubbers the reported rate of soft errors should not depend on utilization levels in the system. Each soft error will eventually be detected (either when the bit is accessed by an application or by the scrubber), corrected and reported. Another observation that supports Conclusion 7 is the strong correlation between errors in the same DIMM. Events that cause soft errors, such as cosmic radiation, are expected to happen randomly over time and not in correlation.
Conclusion 7 is an interesting observation, since much previous work has assumed that soft errors are the dominating error mode in DRAM. Some earlier work estimates hard errors to be orders of magnitude less common than soft errors and to make up about 2% of all errors."
[0] https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
No comments yet
Contribute on Hacker News ↗