← Back to context

Comment by huhtenberg

10 years ago

Did you get any details on these 18 errors? Were they single bit flips?

No, unfortunately. I can't rule out the possibility of physical bus errors (like cable going bad or poor physical connection - in my case, there is one fairly expensive SAS cable per 4 drives, as I'm using a bunch of SAS/SATA backplanes with hotswap caddies); I do think that's probably more likely (or non-ECC RAM bitflip) than on-disk corruption.

But the exact nature of the problem is a distinction without a huge amount of difference to me. If I was copying those files, the copies would be silently corrupt. If I was transcoding or playing videos, the output would have glitches. Etc.

With this many HDDs, there are necessarily more components in the setup, and more things that can go wrong. Meanwhile, I'm not a business customer with profitable clients I can sell extra reliability to, so it's not the most expensive kit I could buy. I went as far as getting WD Red drives, and even then they were misconfigured by default, with an overly aggressive idle timer (8 seconds!) that needed tweaking.

The main thing is: more and bigger drives means increased probability of corruption.