Comment by steven2012
10 years ago
I worked at a storage company and the scariest thing I learned is that your data can be corrupt even though the drive itself says that the data was written correctly. The only way to really be sure is to check your files after writing them that they match. Now whenever I do a backup, I always go through them one more time and do a byte-by-byte comparison before being assured that it's okay.
This is true. Which is why we really, really need checksummed filesystems. I am very worried that this hasn't made its way into mainstream computing yet, especially given the growing drive sizes and massive CPU speed increases.
I run a 10x3TB ZFS raidz2 array at home. I've seen 18 checksum errors at the device level in the last year - these are corruption from the device that ZFS detected with a checksum, and was able to correct using redundancy. If you're not checksumming at some level in your system, you should be outsourcing your storage to someone else; consumer level hardware with commodity file systems isn't good enough.
You know, I'm not really sure I buy this. I worked for a storage company in the past, and I put a simple checksumming algorithm in our code sort of like the zfs one. Turns out that two or three obscure software bugs later that thing stopped firing randomly, and started picking out kernel bugs. Once we nailed a few of those the errors became "harder". By that I mean that, we stopped getting data that the drives claimed was good but we thought was bad.
Modern drives are ECC'ed to hell and back, on a enterprise systems (aka ones with ECC RAM and ECC'ed buses) a sector that comes back bad is likely the result of a software/firmware bug somewhere, and in many cases was written (or simply not written) bad.
Put another way, if you read a sector off a disk and conclude that it was bad, and a reread returns the same data, it was probably written bad. The fun part is then taking the resulting "bad" data and looking at it.
Reminds me of early in my career a linux machine we were running a CVS server on once or twice a year reported corrupted CVS files, and when looking at them, I often found data from other files stuck in the middle, often in 4k sized regions.
> you should be outsourcing your storage to someone else
Well, I'd need to be sure that "someone else" does things properly. My experience with various "someone elses" so far hasn't been stellar — most services I've tried were just a fancy sticker placed onto the same thing that I'm doing myself.
How does checksumming help if the data is in cache and waiting to be written ? For ex: I have 1MB of data, i write it but it stays in buffer cache after written and when you do checksum you are computing the checksum on buffer cache .
On Linux you have to drop_caches and then read get the checksum to be sure. Now per buffer or file drop_cache isnt available as per my knowledge . If you are doing a systemwide drop_caches you are invalidating the good and bad ones.
What if now if device is maintaing cache as well in addition to buffer cache?
Can someone clarify ?
2 replies →
I advocate taking it further with clustered filesystems on inexpensive nodes. Good design of those can solve problems on this side plus system-level. Probably also need inexpensive, commodity knockoffs of things like Stratus and NonStop. Where reliability tops speed, could use high-end embedded stuff like Freescale's to keep the nodes cheap. Some of them support lock-step, too.
2 replies →
As a counterpoint I have 6x3TB zfs raidz2 on freebsd at home. I resilver every month and only had one checksum error that turned out to be a cable going bad given it hasn't repeated.
Still agree with we need checksumming filesystems though. That and gcc ram to make the data written more trustworthy.
2 replies →
I simply don't have the upstream bandwidth necessary to back up 1TB (my estimate of essential data) offsite - it'd take months and take my ADSL line out of use for anything else.
I also have expensive power costs so running something like a Synology DS415 would cost $50 in power a year while barely using it - although that's better than older models.
Did you get any details on these 18 errors? Were they single bit flips?
1 reply →
Fortunately, zfs on Linux is excellent, and is a two-liner on modern Ububtu LTS. (add PPA, install zfs.)
Is it? I've heard several complains about bugs in FUSE.
3 replies →
This is assuming that the underlying block device would forcibly flush those queued writes to disk and then re-read them again rather than just serve them up directly from the pending write queue directly without flushing them first.
You generally can't make that assumption about a black box, so reading back your writes guarantees nothing.
Unless you're intimately familiar with your underlying block device you really can't guarantee anything about writes going to physical hardware. All you can do is read its documentation and hope for the best.
If you need a general hack to that's pretty much guaranteed to flush your writes to a physical disk it would be something like:
Even then you have no guarantees that those writes wouldn't be flushed to the medium while leaving the writes you care about in the block device's internal memory.
This is why end-to-end data integrity with something like T10-PI is a necessity. The kernel block-layer already generates and validates the integrity for us, if the underlying drive supports it, but all major filesystems really need to start supporting it as well.
I don't think that's a necessity for all workflows. Just think about it, that would require all of us buying enterprise 520 or 528 byte sector drives to store the extra checksum information, and a whole new API up to the application level to confirm, point to point, that the data in the app is the data on the drive on writes, and the data on the drive is the data in the app on reads. It's not like T10/PI comes for free just by doing any one thing, it implies changes throughout the chain.