Comment by barrkel
10 years ago
I run a 10x3TB ZFS raidz2 array at home. I've seen 18 checksum errors at the device level in the last year - these are corruption from the device that ZFS detected with a checksum, and was able to correct using redundancy. If you're not checksumming at some level in your system, you should be outsourcing your storage to someone else; consumer level hardware with commodity file systems isn't good enough.
You know, I'm not really sure I buy this. I worked for a storage company in the past, and I put a simple checksumming algorithm in our code sort of like the zfs one. Turns out that two or three obscure software bugs later that thing stopped firing randomly, and started picking out kernel bugs. Once we nailed a few of those the errors became "harder". By that I mean that, we stopped getting data that the drives claimed was good but we thought was bad.
Modern drives are ECC'ed to hell and back, on a enterprise systems (aka ones with ECC RAM and ECC'ed buses) a sector that comes back bad is likely the result of a software/firmware bug somewhere, and in many cases was written (or simply not written) bad.
Put another way, if you read a sector off a disk and conclude that it was bad, and a reread returns the same data, it was probably written bad. The fun part is then taking the resulting "bad" data and looking at it.
Reminds me of early in my career a linux machine we were running a CVS server on once or twice a year reported corrupted CVS files, and when looking at them, I often found data from other files stuck in the middle, often in 4k sized regions.
> you should be outsourcing your storage to someone else
Well, I'd need to be sure that "someone else" does things properly. My experience with various "someone elses" so far hasn't been stellar — most services I've tried were just a fancy sticker placed onto the same thing that I'm doing myself.
How does checksumming help if the data is in cache and waiting to be written ? For ex: I have 1MB of data, i write it but it stays in buffer cache after written and when you do checksum you are computing the checksum on buffer cache .
On Linux you have to drop_caches and then read get the checksum to be sure. Now per buffer or file drop_cache isnt available as per my knowledge . If you are doing a systemwide drop_caches you are invalidating the good and bad ones.
What if now if device is maintaing cache as well in addition to buffer cache?
Can someone clarify ?
How do you know you put good data into the cache in the first place?
There's always going to be a place where errors can creep in. There are no absolute guarantees; it's a numbers game. We've got more and more data, so the chance of corruption increases even if the per-bit probability stays the same. Checksumming reduces the per-bit probability across a whole layer of the stack - the layer where the data lives longest. That's the win.
Agree whole heartedly.
I was asking this thinking of open(<file>,O_DIRECT|O_RDONLY); that bypasses buffer cache and read directly from the disk that atleast solves buffer cache i guess. The disk cache is another thing ie if we disable it we are good at the cost of performance.
I was pointing that tests can do these kind of things.
I advocate taking it further with clustered filesystems on inexpensive nodes. Good design of those can solve problems on this side plus system-level. Probably also need inexpensive, commodity knockoffs of things like Stratus and NonStop. Where reliability tops speed, could use high-end embedded stuff like Freescale's to keep the nodes cheap. Some of them support lock-step, too.
GlusterFS now supports its own (out of band) checksumming. So you could have a Btrfs brick, and an XFS brick to hedge your fs bets. And also setup glusterfs volumes to favor XFS for things like VM images and databases, and Btrfs to favor everything else.
Neat idea. Appreciate the tip on GlusterFS.
As a counterpoint I have 6x3TB zfs raidz2 on freebsd at home. I resilver every month and only had one checksum error that turned out to be a cable going bad given it hasn't repeated.
Still agree with we need checksumming filesystems though. That and gcc ram to make the data written more trustworthy.
had one checksum error that turned out to be a cable going bad given it hasn't repeated
I wouldn't assume that it was a cable error. The SATA interface has a CRC check. So the odds are quite high that a single error would simply result in a retransmission.
Of course, a plethora of detected SATA CRC errors and resulting retransmissions means that an undetected error could readily slip thru. There should be error logs reporting on the occurrence of retransmission, but I'm not enough of a software person to know how possible / easy it is to get that information from a drive or operating system.
OTOH, as you mention later in your post, a single bit error in non-ECC RAM could easily result in a single block having a checksum error. Exactly what you saw!
Hard drives also have spare sectors, so if a defect is detected at one spot in the disk, it will probably never touch that spot again.
Simply observing that an error only occurred once does almost nothing to narrow down the possible causes. You have to also be keeping track of all the error reporting facilities (SMART, PCIe AER, ECC RAM, etc.).
I simply don't have the upstream bandwidth necessary to back up 1TB (my estimate of essential data) offsite - it'd take months and take my ADSL line out of use for anything else.
I also have expensive power costs so running something like a Synology DS415 would cost $50 in power a year while barely using it - although that's better than older models.
Did you get any details on these 18 errors? Were they single bit flips?
No, unfortunately. I can't rule out the possibility of physical bus errors (like cable going bad or poor physical connection - in my case, there is one fairly expensive SAS cable per 4 drives, as I'm using a bunch of SAS/SATA backplanes with hotswap caddies); I do think that's probably more likely (or non-ECC RAM bitflip) than on-disk corruption.
But the exact nature of the problem is a distinction without a huge amount of difference to me. If I was copying those files, the copies would be silently corrupt. If I was transcoding or playing videos, the output would have glitches. Etc.
With this many HDDs, there are necessarily more components in the setup, and more things that can go wrong. Meanwhile, I'm not a business customer with profitable clients I can sell extra reliability to, so it's not the most expensive kit I could buy. I went as far as getting WD Red drives, and even then they were misconfigured by default, with an overly aggressive idle timer (8 seconds!) that needed tweaking.
The main thing is: more and bigger drives means increased probability of corruption.