← Back to context

Comment by StillBored

10 years ago

You know, I'm not really sure I buy this. I worked for a storage company in the past, and I put a simple checksumming algorithm in our code sort of like the zfs one. Turns out that two or three obscure software bugs later that thing stopped firing randomly, and started picking out kernel bugs. Once we nailed a few of those the errors became "harder". By that I mean that, we stopped getting data that the drives claimed was good but we thought was bad.

Modern drives are ECC'ed to hell and back, on a enterprise systems (aka ones with ECC RAM and ECC'ed buses) a sector that comes back bad is likely the result of a software/firmware bug somewhere, and in many cases was written (or simply not written) bad.

Put another way, if you read a sector off a disk and conclude that it was bad, and a reread returns the same data, it was probably written bad. The fun part is then taking the resulting "bad" data and looking at it.

Reminds me of early in my career a linux machine we were running a CVS server on once or twice a year reported corrupted CVS files, and when looking at them, I often found data from other files stuck in the middle, often in 4k sized regions.