← Back to context

Comment by kabdib

3 years ago

Let me narrow my guess: They hit 4 years, 206 days and 16 hours . . . or 40,000 hours.

And that they were sold by HP or Dell, and manufactured by SanDisk.

Do I win a prize?

(None of us win prizes on this one).

These were made by SanDisk (SanDisk Optimus Lightning II) and the number of hours is between 39,984 and 40,032... I can't be precise because they are dead and I am going off of when the hardware configurations were entered in to our database (could have been before they were powered on) or when we handed them over to HN, and when the disks failed.

Unbelievable. Thank you for sharing your experience!

Wow. It's possible that you have nailed this.

Edit: here's why I like this theory. I don't believe that the two disks had similar levels of wear, because the primary server would get more writes than the standby, and we switched between the two so rarely. The idea that they would have failed within hours of each other because of wear doesn't seem plausible.

But the two servers were set up at the same time, and it's possible that the two SSDs had been manufactured around the same time (same make and model). The idea that they hit the 40,000 hour mark within a few hours of each other seems entirely plausible.

Mike of M5 (mikiem in this thread) told us today that it "smelled like a timing issue" to him, and that is squarely in this territory.

  • This morning, I googled for issues with the firmware and the model of SSD, I got nothing. But now I am searching for "40000 hours SSD" and a million relevant results. Of course, why would I search for 40000 hours.

    This thread is making me feel a lot less crazy.

    • I'm hoping that deep in your spam folder is a critical firmware update notice from Dell/EMC/HP/SanDisk from 2 years ago :).

    • There are times I don't miss dealing with random hardware mystery bullshit.

      This one is just ... maddening.

  • This kind of thing is why I love Hacker News. Someone runs into a strange technical situation, and someone else happens to share their own obscure, related anecdote, which just happens to precisely solve the mystery. Really cool to see it benefit HN itself this time.

I wonder if it might be closer to 40,032 hours. The official Dell wording [1] is "after approximately 40,000 hours of usage". 2^57 nanoseconds is 40031.996687737745 hours. Not sure what's special about 57, but a power of 2 limit for a counter makes sense. That time might include some manufacturer testing too.

[1] https://www.reddit.com/r/sysadmin/comments/f5k95v/dell_emc_u...

  • It might not be nanoseconds, but something that's a power of 2 number of nanoseconds going into an appropriately small container seems likely. For example, a 62.5MHz counter going into 53 bits breaks at the same limit. Why 53 bits? That's where things start to get weird with IEEE doubles - adding 1 no longer fits into the mantissa and the number doesn't change. So maybe someone was doing a bit of fp math to figure out the time or schedule a next event? Anyway, very likely some kind of clock math that wrapped or saturated and broke a fundamental assumption.

    • 53 is indeed a magic value for IEEE doubles, but why would anybody count an inherently integer value with floating-point? That's a serious rookie mistake.

      Of course there's no law that says SSD firmware writers can't be rookies.

      1 reply →

  • See! People should register via mail for those important notifications! (Or alternatively do quarterly checks that your firmware is up to date).

    • A lot of companies have teams dedicated to hardware that don’t give a shit about it. And their managers don’t give a shit.

      Then the people under them who do give a shit, because they depend on those servers, aren’t allowed to register with HP etc for updates, or to apply firmware updates, because “separation of duties”.

      Basically, IT is cancer from the head down.