Comment by phil21
6 days ago
I have found there is really no practical way to predict the bathtub curve for hard drive failures.
The solution is just a lot of redundancy for larger disk arrays whenever practical. I currently have a 15x1TB 7200 RPM zpool in raidz2 I use for "scratch space" for some automation projects. It writes about 500GB-1TB or so a day and has for... over 18 years. I have had exactly one drive fail from that pool, under heavy abuse. That one failed a year or two in. Prior to my personal use it was beat on (mostly reads) as backing storage for uploaded images for a large website where the drives operated at 90% or higher I/O utilization pretty much 24x7.
I have other pools of disks where I have replaced over 50% of them 6 years in, with batches of failures seemingly at random. You start to notice patterns with various drive models - but not until well after the point of purchase where it's far too late to predict based on anything like vendor reputation or whatnot. I've had batches of various WD, Seagate, Toshiba, and HGST all both be incredibly reliable and some incredibly not so. Some of the same model series just different drive sizes have wildly different reliability characteristics.
I don't bother pulling "old" drives out of production preemptively any more. The only thing I do preemptively now is pull drives with very critical SMART prefailure warnings such as a consistently growing number of unrecoverable sector errors. That one and a couple other attributes are worth watching trends for, but the rest are pretty pointless and really do not seem to correlate much. And again, it varies by drive model for which times to pay attention to a particular SMART attribute and which not to.
I simply treat drives as wear items that fail with little to no notice, and just make sure I can survive a number of simultaneous failures at once. Make sure to regularly test your monitoring!
Not power cycling drives is huge as well, as you note. For example these old 1TB spinners:
9 Power_On_Hours 0x0012 078 078 000 Old_age Always - 158328
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 13
No comments yet
Contribute on Hacker News ↗