Comment by clintonwoo
3 years ago
This makes total sense but I've never heard of it. Is there any literature or writing about this phenomenon?
I guess proper redundancy is having different brands of equipment also in some cases.
3 years ago
This makes total sense but I've never heard of it. Is there any literature or writing about this phenomenon?
I guess proper redundancy is having different brands of equipment also in some cases.
I hadn't heard of it either until disks in our storage cluster at work started failing faster than the cluster could rebuild in an event our ops team named SATApocalypse. It was a perfect storm of cascading failures.
https://web.archive.org/web/20220330032426/https://ops.faith...
Great read, thank you!
I also don't know about literature on this phenomenon, but i recall HP had two different SSD recalls because when the uptime counter rolled over, they would fail. That's not even load dependent, just did you get a batch and power them on all at the same time. Uptime is too high causing issues isn't that unusual for storage, unfortunately.
It's not always easy, but if you can, you want manufacturer diversity, batch diversity, maybe firmware version diversity[1], and power on time diversity. That adds a lot of variables if you need to track down issues though.
[1] you don't want to have versions with known issues that affect you, but it's helpful to have different versions to diagnose unknown issues.
The crucial M4 had this too but it was fixable with a firmware update.
https://www.neoseeker.com/news/18098-64gb-crucial-m4s-crashi...
That one looks not too bad, seems like you can fix it with a firmware update after it fails. A lot of disk failures due to firmware bugs end up with the disk not responding to the bus, so it becomes somewhere between impossible and impractical to update the firmware.
I don't know about literature, but in the world of RAID this is a common warning.
Having a RAID5 crash and burn because the backup disk failed during the reconstruction phase after a primary disk failed is a common story.
Not sure about literature but that was a known thing in the Ops circles I was in 10 years ago: never use the same brand for disk pairs, to minimize wear-and-tear related defects from arising at the same time.
We used to use the same brand, but different models or at least ensure they were from different manufacturing batches.
Wikipedia has a section on this. It's called "correlated failure." https://en.wikipedia.org/wiki/RAID#Correlated_failures
Not sure about literature, but past anecdotes and HN threads yes.
https://news.ycombinator.com/item?id=4989579