Comment by davedunkin

3 years ago

> Double disk failure is improbable but not impossible.

It's not even improbable if the disks are the same kind purchased at the same time.

67 comments

davedunkin

I once had a small fleet of SSDs fail because they had some uptime counters that overflowed after 4.5 years, and that somehow persistently wrecked some internal data structures. It turned them into little, unrecoverable bricks.

It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.

mikiem 3 years ago
You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.
- kabdib 3 years ago
  
  Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.
  And that they were sold by HP or Dell, and manufactured by SanDisk.
  Do I win a prize?
  (None of us win prizes on this one).
  
  29 replies →
- muttantt 3 years ago
  
  It's concerning that a hosting company was unaware of the 40,000 hour situation with SSD it was deploying. Anyone in hosting would have been made aware of this, or at least should have kept a better grip on happenings in the market.
  
  2 replies →
- chinathrow 3 years ago
  
  How many other customers will/have hit this?
  
  2 replies →
rbanffy 3 years ago

I had a similar issue, but it was a single RAID-5 array and wear of some other manufacture defect. They were the same brand, model, and batch. When the first failed and the array got in recovery mode I ordered 3 replacements and upped the backup frequency. It was good that I did that because the two remaining drives died shortly after.
The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.
mcsee 3 years ago

perilunar 3 years ago

There's a principle in aviation of staggering engine maintenance on multiple-engined airplanes to avoid maintenance-induced errors leading to complete power loss.

e.g. Simultaneous Engine Maintenance Increases Operating Risks, Aviation Mechanics Bulletin, September–October 1999 https://flightsafety.org/amb/amb_sept_oct99.pdf

spiffytech 3 years ago

Yep: if you buy a pair disks together, there's a fair chance they'll both be from the same manufacturing batch, which correlates with disk defects.

bragr 3 years ago
Yeah just coming here to say this. Multiple disk failures are pretty probable. I've had batches of both disks and SSDs with sequential serial numbers, subjected to the same workloads, all fail within the same ~24 hour periods.
- schroeding 3 years ago
  
  Had the same experience with (identical) SSDs, two failures within 10 minutes in a RAID 5 configuration.
  (Thankfully, they didn't completely die but just put themselves into read-only)
- mpyne 3 years ago
  
  Seems like it was only a few days ago that there was a comment from a former Dropbox engineer here pointing out that a lot of disk drives they bought when they stood up their own datacenter had been found to all have a common flaw involving tiny metal slivers.
clintonwoo 3 years ago
This makes total sense but I've never heard of it. Is there any literature or writing about this phenomenon?
I guess proper redundancy is having different brands of equipment also in some cases.
- davedunkin 3 years ago
  
  I hadn't heard of it either until disks in our storage cluster at work started failing faster than the cluster could rebuild in an event our ops team named SATApocalypse. It was a perfect storm of cascading failures.
  https://web.archive.org/web/20220330032426/https://ops.faith...
  
  1 reply →
- toast0 3 years ago
  
  I also don't know about literature on this phenomenon, but i recall HP had two different SSD recalls because when the uptime counter rolled over, they would fail. That's not even load dependent, just did you get a batch and power them on all at the same time. Uptime is too high causing issues isn't that unusual for storage, unfortunately.
  It's not always easy, but if you can, you want manufacturer diversity, batch diversity, maybe firmware version diversity[1], and power on time diversity. That adds a lot of variables if you need to track down issues though.
  [1] you don't want to have versions with known issues that affect you, but it's helpful to have different versions to diagnose unknown issues.
  
  2 replies →
- AceJohnny2 3 years ago
  
  I don't know about literature, but in the world of RAID this is a common warning.
  Having a RAID5 crash and burn because the backup disk failed during the reconstruction phase after a primary disk failed is a common story.
- athenot 3 years ago
  
  Not sure about literature but that was a known thing in the Ops circles I was in 10 years ago: never use the same brand for disk pairs, to minimize wear-and-tear related defects from arising at the same time.
  
  1 reply →
- mceachen 3 years ago
  
  Wikipedia has a section on this. It's called "correlated failure." https://en.wikipedia.org/wiki/RAID#Correlated_failures
- eganist 3 years ago
  
  Not sure about literature, but past anecdotes and HN threads yes.
  https://news.ycombinator.com/item?id=4989579
dspillett 3 years ago

This is why I try to mismatch manufacturers in RAID arrays. I'm told there is a small performance hit (things run towards the speed of the slowest, separately in terms of latency and throughput) but I doubt the difference is high and I like the reduction in potential failure-during-rebuild rates. Of course I have off-machine and off-site backups as well as RAID, but having to use them to restore a large array would be a greater inconvenience than just being able to restore the array (followed by checksum verifies over the whole lot for paranoia's sake).
GekkePrutser 3 years ago
Eek - now I'm glad I wait a few months before buying each disk for my NAS.
Not doing it for this reason but rather financial ones :) But as I have a totally mixed bunch of sizes I have no RAID and a disk loss would be horrible.
- bragr 3 years ago
  
  Have to be careful doing that too or you'll end up with subtly different revisions of the same model. This may or may not cause problems depending on the drives/controller/workload but can result in you chasing down weird performance gremlins or thinking you have a drive that's going bad.
sofixa 3 years ago

That's why serious SAN vendors take care to provide you a mix of disks (e.g. on a brand new NetApp you can see that disks are of 2-3 different types, and with quite different serial numbers).

bink 3 years ago

Or even if the power supplies were purchased around the same time. I had a batch of servers that as soon as they arrived started chewing through hard drives. It took about 10 failed drives before I realized it was a problem with the power supplies.

adrianmonk 3 years ago

I learned this principle by getting a ticket for a burnt out headlight 1 week after I replaced the other one.

hallway_monitor 3 years ago

Anyone familiar with car repair will tell you that if one headlight burns out you should just go ahead and replace both, because of this exact phenomenon. I suppose with LEDs we may not have to worry about it anymore

0xbadcafebee 3 years ago

Even if they're not the same, they're written at the same time and rate, meaning they have the same wear over time, subject to the same power/heat issues, etc.

pmoriarty 3 years ago
Hopefully, regularly checking the disks' S.M.A.R.T status will help you stay on top of issues caused by those factors.
Also, you shouldn't wait for disks to fail to replace them. HN's disks were used for 4.5 years, which is greater than the typical disk lifetime, in my experience. They should have replaced them sooner, one by one, in anticipation of failure. This would also allow them to stagger their disk purchases to avoid similar manufacturing dates.
- justsomehnguy 3 years ago
  
  https://news.ycombinator.com/reply?id=32033520&goto=item%3Fi...
  I've seen too many dead disks with a perfect SMART. When the numbers go down (or up) and triggers are fired then you are surely need to replace the disk[0], but SMART without warnings just means nothing.
  [0] my desktop run for years entirely on the disks removed from the client PCs after a failure. Some of them had a pretty bad SMART, on a couple I needed to move the starting point of the partition a couple GBs further from the sector 0 (otherwise they would stall pretty soon), but overall they worked fine - but I never used them as a reliable storage and I knew I can lose them anytime.
  Of course I don't use repurposed drives in the servers.
  PS and when I tried to post it I received " We're having some trouble serving your request. Sorry! " Sheesh.