Comment by derefr
9 years ago
> ultimately there's very little you can actually do reliably if your memory is lying to you
1. Implement everything in terms of retry-able jobs; ensure that jobs fail when they hit checksum errors.
2. if you've got a bytecode-executing VM, extend it to compare its modules to stored checksums, just before it returns from them; and to throw an exception instead of returning if it finds a problem. (This is a lot like Microsoft's stack-integrity protection, but for notionally "read-only" sections rather than read-write sections.)
3. Treat all such checksum failures as a reason to immediately halt the hardware and schedule it for RAM replacement. Ensure that your job-system handles crashed nodes by rescheduling their jobs to other nodes. If possible, also undo the completion of any recently-completed jobs that ran on that node.
4. Run regular "memtest monkey" jobs on all nodes that attempt to trigger checksum failures. To get this to work well, either:
4a. ensure that jobs die often enough, and are scheduled onto nodes in random-enough orders, that no job ever "pins" a section of physical memory indefinitely;
4b. or, alternately, write your own kernel memory-page allocation strategy, to map physical memory pages at random instead of linearly. (Your TLBs will be very full!)
Mind you, steps 3 and 4 only matter to catch persistent bit-errors (i.e. failing RAM); one-time cosmic-ray errors can only really be caught by steps 1 and 2, and even then, only if they happen to affect memory that ends up checksummed.
How do you calculate those checksums without relying on the memory?
the chances of the memory erroring in such a way that the checksum still matches becomes quite small
You can't really, but you are now requiring the error to occur specifically in the memory containing your checksum, rather than anywhere in your data.
It deeper than that. What are you calculating the checksum of? Is it corrupted already?
If you can't trust your RAM, you have no hard truth to rely on. It's only probabilistic programing or living with the errors.
(Although, rereading the GP, he seems to be talking about corrupted binaries. Yes, you can catch corrupted binaries, but only after they corrupted some data.)
1 reply →