Comment by toast0
2 hours ago
ZFS says "once I've committed to disk, if the data changes, I'll let you know".
This works, regardless of if you have ram errors or not.
I will say that the reported error rate of 5 bit errors per 8 GB per hour in 8% of installed RAM seems incredibly high compared to my experience running on a fleet of about one to three thousand machines with 64-768 GB of ECC RAM. Based on that rate, assuming a thousand machines with 64 GB ram each, we should have been seeing about 3000 bit errors per hour; but ECC reports were rare. Most machines went through their 3-5 year life without reporting any correctable errors. Of the small handful of machines that had errors, most of them went from no errors to a concerning amount of errors in a short time and were shut down to have their ram replaced; a few threw uncorrectable errors, most of those threw a second uncorrectable shortly thereafter and had their ram replaced; there was one or two that would do about one correctable error per day and we let those run. There was one, maybe two that were having so many correctable errors that the machine check exceptions caused operational problems that didn't make sense until the hourly ECC report came up with a huge number.
The real tricky one without ECC is that one bit error a day case... that's likely to corrupt data silently, without any other symptoms. If you have a lot of bit errors, chances are the computer will operate poorly; you'll probably end up with some corrupt data, but you'll also have a lot of crashing and hopefully run a memtest and figure it out.
No comments yet
Contribute on Hacker News ↗