Comment by kabdib

3 years ago

I once had a small fleet of SSDs fail because they had some uptime counters that overflowed after 4.5 years, and that somehow persistently wrecked some internal data structures. It turned them into little, unrecoverable bricks.

It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.

39 comments

kabdib

mikiem 3 years ago

You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.

kabdib 3 years ago
Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.
And that they were sold by HP or Dell, and manufactured by SanDisk.
Do I win a prize?
(None of us win prizes on this one).
- mikiem 3 years ago
  
  These were made by SanDisk (SanDisk Optimus Lightning II) and the number of hours is between 39,984 and 40,032... I can't be precise because they are dead and I am going off of when the hardware configurations were entered in to our database (could have been before they were powered on) or when we handed them over to HN, and when the disks failed.
  Unbelievable. Thank you for sharing your experience!
- dang 3 years ago
  
  Wow. It's possible that you have nailed this.
  Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.
  But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.
  Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.
  
  16 replies →
- mkl 3 years ago
  
  I wonder if it might be closer to 40,032 hours. The official Dell wording [1] is "after approximately 40,000 hours of usage". 2^57 nanoseconds is 40031.996687737745 hours. Not sure what's special about 57, but a power of 2 limit for a counter makes sense. That time might include some manufacturer testing too.
  [1] https://www.reddit.com/r/sysadmin/comments/f5k95v/dell_emc_u...
  
  5 replies →
- agileAlligator 3 years ago
  
  Bang on!
  https://news.ycombinator.com/item?id=32048148
- pankajdoharey 3 years ago
  
  Do they use SSD on space missions aswell?
  
  1 reply →
- Amfy 3 years ago
  
  is this leased to HN as dedicated/baremetal servers or colocation aka HN owns the hardware?
  
  1 reply →
muttantt 3 years ago
It's concerning that a hosting company was unaware of the 40,000 hour situation with SSD it was deploying. Anyone in hosting would have been made aware of this, or at least should have kept a better grip on happenings in the market.
- dogecoinbase 3 years ago
  
  Yeah, this is why you run all equipment in a test environment for 4.5 years before deploying it to prod. Really basic stuff.
  
  1 reply →
chinathrow 3 years ago
How many other customers will/have hit this?
- qu1j0t3 3 years ago
  
  Every large DC will have hit it (Amazon, Facebook, Google, etc). But it's a shame that all their operational knowledge is kept secret.
  
  1 reply →

rbanffy 3 years ago

I had a similar issue, but it was a single RAID-5 array and wear of some other manufacture defect. They were the same brand, model, and batch. When the first failed and the array got in recovery mode I ordered 3 replacements and upped the backup frequency. It was good that I did that because the two remaining drives died shortly after.

The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.

mcsee 3 years ago