Comment by Helmut10001
3 months ago
I don't understand why ECC memory is not the norm these days. It is only slightly more expensive, but solves all these problems. Some consumer mainboards even support it already.
3 months ago
I don't understand why ECC memory is not the norm these days. It is only slightly more expensive, but solves all these problems. Some consumer mainboards even support it already.
No it doesn’t :-)
I’ve had plenty of servers with faulty ecc dimms that didn’t trigger , and would only show faults when actual memory testing. I had a hard time convincing some of our admins the first time ( ‘no ecc faults you can’t be right ‘ ) but I won the bet.
Edit: very old paper by google on these topics. My issues were 6-7 years ago probably.
https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
If we’re being pragmatic, it solves enough problems that you could still call it an undisputed win for stability.
That shouldn’t make sense. It’s not like the ECC info is stored in additional bits separate from the data, it’s built in with the data so you can’t “ignore” it. Hmm, off to read the paper.
The ECC information is stored in separate DRAM devices on the DIMM. This is responsible for some of the increased cost of DIMMs with ECC at a given size. When marketed the extra memory for ECC are typically not included in the size for DIMMs so a 32GB DIMM with and without ECC will have differing numbers of total DRAM devices.
There's a pretty good set of diagrams and descriptions of the faults in this paper https://dl.acm.org/doi/10.1145/3725843.3756089.
Also to the parent: there's an updated public paper on DDR4 era fault observations https://ieeexplore.ieee.org/document/10071066
3 replies →
I fully agree with you ! Neither soft nor hard memory errors, nothing… but but flips ,and reproducible at that.
We scanned all our machines following this ( a few thousand servers ) and found out that ram issues were actually quite common, as said in the paper.
I'm sorry, but I, just like your admins, don't believe this. It's theoretically possible to have "undetectable" errors, but it's very unlikely and you'd see a much higher than this incidence of detected unrecoverable errors and you'd see a much higher incidence than this of repaired errors. I just don't buy the argument of "invisible errors".
EDIT: took a look on the paper you linked and it basically says the same thing I did. The probability of these cases becomes increasingly and increasingly small and while ECC would indeed, not reduce it to _zero_ it would greatly greatly reduce it.
Well my admins eventually believed me , so I’m fairly comfortable with what I said.
We also had a few thousands of physical servers with about of terabyte of ram each.
You are right : we did see repaired errors, but we also saw (indirectly, and after testing ) unrepaired ones
5 replies →
were they 3-bit flips?
It seems extremely unlikely that you’d end up with a lot of those but no smaller detectable errors.
Why? Intel making and keeping it workstation/Xeon-exclusive for a premium for too long. And AMD is still playing along not forcing the issue with their weird "yeah, Zen supports it, but your mainboard may or may not, no idea, don't care, do your own research" stance. These days it's a chicken and egg problem re: price and availability and demand. See also https://news.ycombinator.com/item?id=29838403
Maybe it's high time for some regulation?
E.g. EU enforced mandatory USB-C charging from 2025, and pushes for ending production of combustion engine cars by 2035. Why not just make ECC RAM mandatory in new computers starting e.g. from 2030?
AMD is already one step away from being compliant. So, it's not an outlandish requirement. And regulating will also force Intel to cut their BS, or risk losing the market.
OMG no. Politician have no business making technological decisions. They make it harder to innovate, i.e. to invent the next generation of ECC with a different name.
6 replies →
Cost. You are about to making computers 10-20% more expensive.
Computers also aren't used much these days, and phones and tables don't have ECC
1 reply →
Thanks for the details. I agree and had the same experience, trying to figure out if an AMB motherboard supports ECC or not. It is almost impossible to know ahead of trying it. At least we have ZFS now for parity checks on cold storage.
Bit flips do not only happen inside RAM
Also, in a game, there is a tremendously large chance that any particular bit flip will have exactly 0 effect on anything. Sure you can detect them, but one pixel being wrong for 1/60th of a second isn't exactly ... concerning.
The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully. There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.
> The chance for a bit flip to affect a critical path that is noticeable by the player is very low, and quite a bit lower if you design your game to react gracefully.
Nobody does
> There's a whole practice of writing code for radiation hardened environments that largely consists of strategies for recovering from an impossible to reach state.
And again, nobody except stuff that goes to space and few critical machines does. The closest normal user will get to code written like that are probably car ECUs, there are even automotive targeted MCUs that not only run ecc but also 2 cores in parallel and crash if they disagree
Sure they do, you just have to think about it a different way.
It boils down to exception handling, you don't expect all of your bugs or security vulnerabilities to be known and write your code to be able to react to unplanned states without crashing. Bugs or security vulnerabilities can look a lot like a cosmic ray... a buffer overflow putting garbage in unexpected memory locations vs a cosmic ray putting garbage in unexpected memory locations... a lot of the mitigations are quite the same.
> code for radiation hardened environments
I’m aware of code that detects bit flips via unreasonable value detection (“this counter cannot be this high so quickly”). What else is there?
For safety critical systems, one strategy is to store at least two copies of important data and compare them regularly. If they don't match, you either try to recover somehow or go into a safe state, depending on the context.
11 replies →
You can have voting systems in place, where at least 2 out of 3 different code paths have to produce the same output for it to be accepted. This can be done with multiple systems (by multiple teams/vendors) or more simply with multiple tries of the same path, provided you fully reload the input in between.
The simplest one is a watchdog: If something stops with regular notifications, then restart stuff.
1 reply →
Interesting, I was not aware! Do you have a statistics for the bit flips in RAM %? My feeling would be its the majority of bit flips that happen, but I can be wrong.
IEC 61508 estimates a soft error rate of about 700 to 1200 FIT (Failure in Time, i.e. 1E-9 failures/hour).
That was in the 2000s though, and for embedded memory above 65nm. I would expect smaller sizes to be more error-prone.
It would be quite hard to gather that data and would be highly dependent on hardware and source of bit flip.
But there's volatile and nonvolatile memory all over in a computer and anywhere data is in flight be it inside the CPU or in any wires, traces, or other chips along the data path can be subject to interference, cosmic rays, heat or voltage related errors, etc.
2 replies →
In case of Intel it's mostly coz they want to sell it as enterprise/workstation feature and make people pay extra.
AMD has been better on it but BIOS/mobo vendors not so much
Well for DDR5 that's 25% more chips which isn't great even if you don't get ripped off by market segmentation.
It's possible DDR6 will help. If it gets the ability to do ECC over an entire memory access like LPDDR, that could be implemented with as little as 3% extra chip space.
Why 25%, shouldn't it be 12.5%? 8 ECC bits for every 64 bits.
DDR5 ECC RDIMMs (R=registered) have 16 extra bits. From the specifications for Kingston's KSM64R52BS8-16MD [1]:
> x80 ECC (x40, 2 independent I/O sub channels)
On the other hand ECC UDIMMs (U=unbuffered) have only 8. From the specifications for Kingston's KSM56E46BS8KM-16HA [2]:
> x72 ECC (x36, 2 independent I/O sub channels)
Though if I remember correctly, the specifications for the older DDR4 ECC RDIMMs mention only 72 bits.
[1]: https://www.kingston.com/datasheets/KSM64R52BS8-16HA.pdf
[2]: https://www.kingston.com/datasheets/KSM56E46BS8KM-16HA.pdf
And checksummed filesystems.
What I'm wondering, even without ECC, afaik standard ram still has a parity bit, so a single flip should be detected. With ECC it would be fixed, without ECC it would crash the system. For it to get through and cause an app to malfunction you need two bit flips at least.
I think standard RAM used to have long long time ago, but not anymore. DDR5 finally readd it sort of.
Yes, 30 pin SIMMs (the most common memory format from the mid-80s to the mid-90s) came in either '8 chip' or '9 chip' variants - the 9th chip being for the parity bit.
Most motherboards supported both, and the choice of which to use came down to the cost differential at the time of building a particular machine. The wild swings in DRAM prices meant that this could go from being negligible to significant within the course of a year or two!
When 72 pin SIMMs were introduced, they could in theory also come in a parity version but in reality that was fairly rare (full ECC was much better, and only a little more expensive). I don't think I ever saw an EDO 72 pin SIMM with parity, and it simply wasn't an option for DIMMs and later.
Wrong. Regular RAM has no parity bit.
Talk to someone in consumer sales about customer priorities. A bit-cheaper computer? Or one which which is, in theory, more resilient against some rare random sort of problem which customers do not see as affecting them.