Comment by nostrademons
9 years ago
While I was at Google, someone asked one of the very early Googlers (I think it was Craig Silverstein, but it may've been Jeff Dean) what was the biggest mistake in their Google career, and they said "Not using ECC memory on early servers." If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.
It saved a few bucks in a time period where Google's hardware costs were rising rapidly, but the ripple-on effects on system design cost much more than that in lost engineer time. Data integrity is one engineering constraint that should be pushed as low down in the stack as is reasonably possible, because as you get higher up the stack, the potential causes of corrupted data multiple exponentially.
Google had done extensive studies[1]. There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about. However if you are in data center with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day. Now if data is being replicated then these errors can propogate corrupted data in unpredictable unexplainable way even when there are no bugs in your code! For example, you might encounter your logs containing bad line items which gets aggregated in to report showing bizarre numbers because 0x1 turned in to 0x10000001. You can imagine that debugging this happening every day would be huge nightmare and developers would end up eventually inserting lot of asserts for data consistency all over the places. So ECC becomes important if you have distributed large scale system.
1: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf
That data set covers 2006-2009 and the ram consisted of 1-4GB DDR2 running at 400-800 MB/S. Back when 4GB was considered a beefy desktop, consumers could get away with a few bit-flips during the lifetime of the machine. Now my phone has that much RAM and a beefy desktop consists of 16-32 GB of RAM running at 3GB/s.
It's time we start trading off the generous speed and capacity gains for a some error correction.
Note that the error rate is not proportional to the amount of RAM, it is proportional to the physical volume of the ram chips. (The primary mechanism that causes errors are highly energetic particles hitting the chips, the chance that this happens is proportional to the volume of the chips.) This means that the error rate per bit goes down as density goes up.
6 replies →
That's a 3% per DIMM per year chance of at least one error. Most memory faults are persistent and cause errors until the DIMM is replaced. Also, the error rate was only that low for the smallest DDR2 DIMMs.
I have hit soft errors in every desktop machine that used ECC. Either I have bad luck, ECC causes the errors or third thing. I think ECC should be mandated for anything except toys and video players.
> I have hit soft errors in every desktop machine that used ECC.
Not sure if I should start getting nervous or just your RAM sucks ;) I get ECC errors only if I overclock too much, and I run the RAM overclocked all time. It's actually one of the reasons I wanted ECC.
1 reply →
How much more expensive is ECC ram? I don't have it and I've never experienced obvious issues, if it's a lot more expensive it's not really worth it for the once or twice the desktop will likely experience an actual issue
12 replies →
> There is roughly 3% chance of error in RAM per DIMM per year. […] with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day.
Can you work out the math? I don't follow it. 3%×100K×8÷365=66 per day by my reasoning…
they've multiplied by 3 instead of 0.03
> There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about.
How do you make that leap?
It's an inappropriate leap. Consumers should have ECC memory too.
However the consumer market has long decided to settle for ECC nowhere and cheap everywhere.
ECC hardware comes at premium option that can easily be +100%. You need support in the memory, the motherboard and the CPU.
Given the price difference, personal computers will have to live with the memory errors. People will not pay double for their computers. Manufacturers will not sacrifice their margin while they can segment the market and make a ton of money off ECC.
9 replies →
I'd like to know this, too.
I am guessing it's because, if RAM errors increase linearly with the number of computers, then RAM errors will be a greater and greater proportion of total errors. This assumes other kinds of errors don't scale linearly. Someone looking through logs is looking for errors, they'd like to find fixable logic errors, not inevitable RAM errors.
A cost/benefit analysis for a system where non critical operations are performed would seem to favor the non ECC memory. I suspect this is the case for the majority of people who have computers for their personal use, without taking into account that they might not even be aware such a thing exists. Although, I haven't compared ECC prices lately.
1 reply →
Probably assumptions about uses of PC. I'd imagine most of bits are media related.
Because the market.
This makes me wonder how banks deal with this issue.
> If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.
Details of this would be very interesting, but obviously I understand if you cannot provide such details due to NDAs, etc.
I mean, I can imagine a few mitigations (pervasive checksumming, etc), but ultimately there's very little you can actually do reliably if your memory is lying to you[1]. I can imagine that probabilistic programming would be an option, but it's hardly "mainstream" nor particularly performant :)
I'm also somewhat dismayed at the price premium that Intel are charging for basic ECC support. This is a case where AMD really is a no-brainer for commodity servers unless you're looking for single-CPU performance.
[1] Incidentally also true of humans.
You need ECC /and/ pervasive checksumming. There are too many stages of processing where errors can occur. For example, disk controllers or networks. The TCP checksum is a bit of a joke at 16 bits (it will fail to detect 1 in 65000 errors), and even the Ethernet CRC can fail - you need end to end checksums.
http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
I did a bunch of protocol level design in the 90's and one of the handful of things that taught me was _ALWAYS_ use at least a CRC with a standard polynomial. Its just not worth it, in the 2000's I relearned the lesson when it comes to data at rest (on disk/etc). If nothing else both of those will catch "bugs" rather than silently corrupting things and leading to mysteries long after the initial data was corrupted.
I just had this discussion (about why TCP's checksum was a huge mistake) a couple days ago. That link is going to be useful next time it comes up.
Too many stages... for what? You haven't stated what the criteria for 'recovery' (for lack of a better word) are. What is the (intrisic) value of the data?
Personally, I'm a bit of a hoarder of data, but honestly, if X-proportion of that data were to be lost... it probably wouldn't actually affect my life substantially even though I feel like it would be devastating.
Crc checksums can be wrong if you have multiple bit errors like runs of zeros. (This resets the polynomial computation) http://noahdavids.org/self_published/CRC_and_checksum.html
but crc is good to check against single bit errors.
> ultimately there's very little you can actually do reliably if your memory is lying to you
1. Implement everything in terms of retry-able jobs; ensure that jobs fail when they hit checksum errors.
2. if you've got a bytecode-executing VM, extend it to compare its modules to stored checksums, just before it returns from them; and to throw an exception instead of returning if it finds a problem. (This is a lot like Microsoft's stack-integrity protection, but for notionally "read-only" sections rather than read-write sections.)
3. Treat all such checksum failures as a reason to immediately halt the hardware and schedule it for RAM replacement. Ensure that your job-system handles crashed nodes by rescheduling their jobs to other nodes. If possible, also undo the completion of any recently-completed jobs that ran on that node.
4. Run regular "memtest monkey" jobs on all nodes that attempt to trigger checksum failures. To get this to work well, either:
4a. ensure that jobs die often enough, and are scheduled onto nodes in random-enough orders, that no job ever "pins" a section of physical memory indefinitely;
4b. or, alternately, write your own kernel memory-page allocation strategy, to map physical memory pages at random instead of linearly. (Your TLBs will be very full!)
Mind you, steps 3 and 4 only matter to catch persistent bit-errors (i.e. failing RAM); one-time cosmic-ray errors can only really be caught by steps 1 and 2, and even then, only if they happen to affect memory that ends up checksummed.
How do you calculate those checksums without relying on the memory?
4 replies →
Pervasive checksumming is going to cost a lot of CPU and touch a lot of memory. The data could be right, the checksum wrong as well. ECC double bit errors are recognized and you can handle them how you'd like, including killing the affected process.
I agree, which is why I used the word "mitigation", as in: not a solution.
Probabilistic programming is a theoretical possibility, but not really practical.
it was indeed Craig
Given that cosmic radiation is one source of memory errors, shouldn't just better computer cases reduce memory errors?
Basically a tin-foil (or plumb-foil) hat over my computer?