Comment by lomnakkus
9 years ago
> If you look through the source code & postmortems from that era of Google, there are all sorts of nasty hacks and system design constraints that arose from the fact that you couldn't trust the bits that your RAM gave back to you.
Details of this would be very interesting, but obviously I understand if you cannot provide such details due to NDAs, etc.
I mean, I can imagine a few mitigations (pervasive checksumming, etc), but ultimately there's very little you can actually do reliably if your memory is lying to you[1]. I can imagine that probabilistic programming would be an option, but it's hardly "mainstream" nor particularly performant :)
I'm also somewhat dismayed at the price premium that Intel are charging for basic ECC support. This is a case where AMD really is a no-brainer for commodity servers unless you're looking for single-CPU performance.
[1] Incidentally also true of humans.
You need ECC /and/ pervasive checksumming. There are too many stages of processing where errors can occur. For example, disk controllers or networks. The TCP checksum is a bit of a joke at 16 bits (it will fail to detect 1 in 65000 errors), and even the Ethernet CRC can fail - you need end to end checksums.
http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html
I did a bunch of protocol level design in the 90's and one of the handful of things that taught me was _ALWAYS_ use at least a CRC with a standard polynomial. Its just not worth it, in the 2000's I relearned the lesson when it comes to data at rest (on disk/etc). If nothing else both of those will catch "bugs" rather than silently corrupting things and leading to mysteries long after the initial data was corrupted.
I just had this discussion (about why TCP's checksum was a huge mistake) a couple days ago. That link is going to be useful next time it comes up.
Too many stages... for what? You haven't stated what the criteria for 'recovery' (for lack of a better word) are. What is the (intrisic) value of the data?
Personally, I'm a bit of a hoarder of data, but honestly, if X-proportion of that data were to be lost... it probably wouldn't actually affect my life substantially even though I feel like it would be devastating.
Crc checksums can be wrong if you have multiple bit errors like runs of zeros. (This resets the polynomial computation) http://noahdavids.org/self_published/CRC_and_checksum.html
but crc is good to check against single bit errors.
> ultimately there's very little you can actually do reliably if your memory is lying to you
1. Implement everything in terms of retry-able jobs; ensure that jobs fail when they hit checksum errors.
2. if you've got a bytecode-executing VM, extend it to compare its modules to stored checksums, just before it returns from them; and to throw an exception instead of returning if it finds a problem. (This is a lot like Microsoft's stack-integrity protection, but for notionally "read-only" sections rather than read-write sections.)
3. Treat all such checksum failures as a reason to immediately halt the hardware and schedule it for RAM replacement. Ensure that your job-system handles crashed nodes by rescheduling their jobs to other nodes. If possible, also undo the completion of any recently-completed jobs that ran on that node.
4. Run regular "memtest monkey" jobs on all nodes that attempt to trigger checksum failures. To get this to work well, either:
4a. ensure that jobs die often enough, and are scheduled onto nodes in random-enough orders, that no job ever "pins" a section of physical memory indefinitely;
4b. or, alternately, write your own kernel memory-page allocation strategy, to map physical memory pages at random instead of linearly. (Your TLBs will be very full!)
Mind you, steps 3 and 4 only matter to catch persistent bit-errors (i.e. failing RAM); one-time cosmic-ray errors can only really be caught by steps 1 and 2, and even then, only if they happen to affect memory that ends up checksummed.
How do you calculate those checksums without relying on the memory?
the chances of the memory erroring in such a way that the checksum still matches becomes quite small
You can't really, but you are now requiring the error to occur specifically in the memory containing your checksum, rather than anywhere in your data.
2 replies →
Pervasive checksumming is going to cost a lot of CPU and touch a lot of memory. The data could be right, the checksum wrong as well. ECC double bit errors are recognized and you can handle them how you'd like, including killing the affected process.
I agree, which is why I used the word "mitigation", as in: not a solution.
Probabilistic programming is a theoretical possibility, but not really practical.