← Back to context

Comment by clintonwoo

3 years ago

This makes total sense but I've never heard of it. Is there any literature or writing about this phenomenon?

I guess proper redundancy is having different brands of equipment also in some cases.

I also don't know about literature on this phenomenon, but i recall HP had two different SSD recalls because when the uptime counter rolled over, they would fail. That's not even load dependent, just did you get a batch and power them on all at the same time. Uptime is too high causing issues isn't that unusual for storage, unfortunately.

It's not always easy, but if you can, you want manufacturer diversity, batch diversity, maybe firmware version diversity[1], and power on time diversity. That adds a lot of variables if you need to track down issues though.

[1] you don't want to have versions with known issues that affect you, but it's helpful to have different versions to diagnose unknown issues.

I don't know about literature, but in the world of RAID this is a common warning.

Having a RAID5 crash and burn because the backup disk failed during the reconstruction phase after a primary disk failed is a common story.

Not sure about literature but that was a known thing in the Ops circles I was in 10 years ago: never use the same brand for disk pairs, to minimize wear-and-tear related defects from arising at the same time.

  • We used to use the same brand, but different models or at least ensure they were from different manufacturing batches.