Comment by Agingcoder

3 months ago

No it doesn’t :-)

I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.

Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.

https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

If we’re being pragmatic, it solves enough problems that you could still call it an undisputed win for stability.

That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.

  • The ECC information is stored in separate DRAM devices on the DIMM. This is responsible for some of the increased cost of DIMMs with ECC at a given size. When marketed the extra memory for ECC are typically not included in the size for DIMMs so a 32GB DIMM with and without ECC will have differing numbers of total DRAM devices.

    There's a pretty good set of diagrams and descriptions of the faults in this paper https://dl.acm.org/doi/10.1145/3725843.3756089.

    Also to the parent: there's an updated public paper on DDR4 era fault observations https://ieeexplore.ieee.org/document/10071066

    • I think you responded to the wrong person, unless you think I was implying that the extra bits needed for ECC didn’t need extra space at all? I wasn’t suggesting that - just that they aren’t like a checksum that is stored elsewhere or something that can be ignored - the whole 72 bits are needed to decode the 64 bits of data and the 64 bits of data cannot be read independently.

      2 replies →

  • I fully agree with you ! Neither soft nor hard memory errors, nothing… but but flips ,and reproducible at that.

    We scanned all our machines following this ( a few thousand servers ) and found out that ram issues were actually quite common, as said in the paper.

I'm sorry, but I, just like your admins, don't believe this. It's theoretically possible to have "undetectable" errors, but it's very unlikely and you'd see a much higher than this incidence of detected unrecoverable errors and you'd see a much higher incidence than this of repaired errors. I just don't buy the argument of "invisible errors".

EDIT: took a look on the paper you linked and it basically says the same thing I did. The probability of these cases becomes increasingly and increasingly small and while ECC would indeed, not reduce it to _zero_ it would greatly greatly reduce it.

  • Well my admins eventually believed me , so I’m fairly comfortable with what I said.

    We also had a few thousands of physical servers with about of terabyte of ram each.

    You are right : we did see repaired errors, but we also saw (indirectly, and after testing ) unrepaired ones

    • Ok, I am sure there is _some_ amount of unrepairable errors.

      But the initial discussion was that ECC ram makes it go away and your point that it doesn't. And the vast vast majority of the errors, according to my understanding and to the paper you pointed to, are repairable. About 1 out of 400 ish errors are non-repairable. That's a huge improvement! If you had ECC ram, the failures Firefox sees here would drop from 10% to 0.025%! That is highly significant!

      Even more! 2 bit errors now you would be informed of! You would _know_ what is wrong.

      You could have 3(!) bit errors and this you might not see, but they'd be several orders of magnitude even rarer.

      So yes, it would not 100% go away, but 99.9 % go away. That's... Making it go away in my book.

      And last but not least, this paper mentions uncorrectable errors. It says nothing of undetectable ecc errors! You said _undetectable_ errors. I'm sure they happen, but would be surprised if you have any meaningful incidence of this, even at terabytes of data. It's probay on the order of 0.000625 of errors you can get ( but if you want I can do more solid math)

      4 replies →