Comment by sytelus

9 years ago

Google had done extensive studies[1]. There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about. However if you are in data center with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day. Now if data is being replicated then these errors can propogate corrupted data in unpredictable unexplainable way even when there are no bugs in your code! For example, you might encounter your logs containing bad line items which gets aggregated in to report showing bizarre numbers because 0x1 turned in to 0x10000001. You can imagine that debugging this happening every day would be huge nightmare and developers would end up eventually inserting lot of asserts for data consistency all over the places. So ECC becomes important if you have distributed large scale system.

1: http://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

That data set covers 2006-2009 and the ram consisted of 1-4GB DDR2 running at 400-800 MB/S. Back when 4GB was considered a beefy desktop, consumers could get away with a few bit-flips during the lifetime of the machine. Now my phone has that much RAM and a beefy desktop consists of 16-32 GB of RAM running at 3GB/s.

It's time we start trading off the generous speed and capacity gains for a some error correction.

  • Note that the error rate is not proportional to the amount of RAM, it is proportional to the physical volume of the ram chips. (The primary mechanism that causes errors are highly energetic particles hitting the chips, the chance that this happens is proportional to the volume of the chips.) This means that the error rate per bit goes down as density goes up.

    • Cosmic rays causing the errors has got me thinking about if the error rates vary with the time.

      Do you get more/less errors when it's day time (due to the Sun)? Does the season affect it (axial tilt means you're more/less "in view" of the galactic core)?

    • Wouldn't it go up if the density increases? If the particle hits the chip there are more bits at the place where the particle hits.

      So while the chance of hit is lower (per GB), if it hits its effect will be higher (more bits flipped).

      4 replies →

That's a 3% per DIMM per year chance of at least one error. Most memory faults are persistent and cause errors until the DIMM is replaced. Also, the error rate was only that low for the smallest DDR2 DIMMs.

I have hit soft errors in every desktop machine that used ECC. Either I have bad luck, ECC causes the errors or third thing. I think ECC should be mandated for anything except toys and video players.

  • > I have hit soft errors in every desktop machine that used ECC.

    Not sure if I should start getting nervous or just your RAM sucks ;) I get ECC errors only if I overclock too much, and I run the RAM overclocked all time. It's actually one of the reasons I wanted ECC.

    • Different RAM, more soft errors the older a system gets. Heh, the system should auto over clock until it starts to get correctable soft errors and then back off. Or reduce refresh until soft errors and then bump it up. Max speed at the lowest power.

  • How much more expensive is ECC ram? I don't have it and I've never experienced obvious issues, if it's a lot more expensive it's not really worth it for the once or twice the desktop will likely experience an actual issue

    • Should be about 1/8th more since it's just a 72-bit bus for carrying 64-bits data and 8-bits check. Or rather, your dimm will have 9 chips instead of 8.

      How they get you is Intel will sell you a xeon which is the exact same die as an i5 in a different package for more money.

      5 replies →

    • It's significantly more expensive, usually around 30-100% more, depending on capacity. IMO not worth it on a desktop, possibly worth it on a home server or a serious workstation. Plus your CPU and motherboard has to support it, which is a pain with Intel's consumer lineup.

      4 replies →

    • usually its cheaper because of server market forced upgrade cycle surplus. Problem is its mostly Buffered/Registered ECC which cant be used in desktop motherboards.

> There is roughly 3% chance of error in RAM per DIMM per year. […] with 100K machines each with 8 DIMM, you are looking at about 6K machines experiencing RAM errors each day.

Can you work out the math? I don't follow it. 3%×100K×8÷365=66 per day by my reasoning…

> There is roughly 3% chance of error in RAM per DIMM per year. That doesn't justify buying ECC if you have just one personal computer to worry about.

How do you make that leap?

  • It's an inappropriate leap. Consumers should have ECC memory too.

    However the consumer market has long decided to settle for ECC nowhere and cheap everywhere.

    ECC hardware comes at premium option that can easily be +100%. You need support in the memory, the motherboard and the CPU.

    Given the price difference, personal computers will have to live with the memory errors. People will not pay double for their computers. Manufacturers will not sacrifice their margin while they can segment the market and make a ton of money off ECC.

    • Bristol Ridge does support ECC BTW, but one problem is that you can't use ECC with x16 chips (because ECC is 72-bit), so with 8GB of RAM and 8Gbit chips you have to choose between non-ECC/ECC single channel with x8 chips and non-ECC dual channel with x16 chips. 4Gbit don't have this problem but will become obsolete especially when 18nm ramps up, and while DRAM prices should decline when that happens...

      2 replies →

  • I'd like to know this, too.

    I am guessing it's because, if RAM errors increase linearly with the number of computers, then RAM errors will be a greater and greater proportion of total errors. This assumes other kinds of errors don't scale linearly. Someone looking through logs is looking for errors, they'd like to find fixable logic errors, not inevitable RAM errors.

  • A cost/benefit analysis for a system where non critical operations are performed would seem to favor the non ECC memory. I suspect this is the case for the majority of people who have computers for their personal use, without taking into account that they might not even be aware such a thing exists. Although, I haven't compared ECC prices lately.

    • Your game machine can live without ECC.

      Your NAS should better have it, though.