Comment by sillysaurusx

3 years ago

HN was down because the failover server also failed: https://twitter.com/HNStatus/status/1545409429113229312

Double disk failure is improbable but not impossible.

The most impressive thing is that there seems to be no dataloss, almost whatsoever. Whatever the backup system is, it seems rock solid.

127 comments

sillysaurusx

davedunkin 3 years ago

> Double disk failure is improbable but not impossible.

It's not even improbable if the disks are the same kind purchased at the same time.

kabdib 3 years ago
I once had a small fleet of SSDs fail because they had some uptime counters that overflowed after 4.5 years, and that somehow persistently wrecked some internal data structures. It turned them into little, unrecoverable bricks.
It was not awesome seeing a bunch of servers go dark in just about the order we had originally powered them on. Not a fun day at all.
- mikiem 3 years ago
  
  You are never going to guess how long the HN SSDs were in the servers... never ever... OK... I'll tell you: 4.5years. I am not even kidding.
  
  36 replies →
- rbanffy 3 years ago
  
  I had a similar issue, but it was a single RAID-5 array and wear of some other manufacture defect. They were the same brand, model, and batch. When the first failed and the array got in recovery mode I ordered 3 replacements and upped the backup frequency. It was good that I did that because the two remaining drives died shortly after.
  The lesson I learned is that the three replacements went to different arrays and we never again let drives from the same batch be part of the same array.
- mcsee 3 years ago
perilunar 3 years ago

There's a principle in aviation of staggering engine maintenance on multiple-engined airplanes to avoid maintenance-induced errors leading to complete power loss.
e.g. Simultaneous Engine Maintenance Increases Operating Risks, Aviation Mechanics Bulletin, September–October 1999 https://flightsafety.org/amb/amb_sept_oct99.pdf
spiffytech 3 years ago
Yep: if you buy a pair disks together, there's a fair chance they'll both be from the same manufacturing batch, which correlates with disk defects.
- bragr 3 years ago
  
  Yeah just coming here to say this. Multiple disk failures are pretty probable. I've had batches of both disks and SSDs with sequential serial numbers, subjected to the same workloads, all fail within the same ~24 hour periods.
  
  2 replies →
- clintonwoo 3 years ago
  
  This makes total sense but I've never heard of it. Is there any literature or writing about this phenomenon?
  I guess proper redundancy is having different brands of equipment also in some cases.
  
  10 replies →
- dspillett 3 years ago
  
  This is why I try to mismatch manufacturers in RAID arrays. I'm told there is a small performance hit (things run towards the speed of the slowest, separately in terms of latency and throughput) but I doubt the difference is high and I like the reduction in potential failure-during-rebuild rates. Of course I have off-machine and off-site backups as well as RAID, but having to use them to restore a large array would be a greater inconvenience than just being able to restore the array (followed by checksum verifies over the whole lot for paranoia's sake).
- GekkePrutser 3 years ago
  
  Eek - now I'm glad I wait a few months before buying each disk for my NAS.
  Not doing it for this reason but rather financial ones :) But as I have a totally mixed bunch of sizes I have no RAID and a disk loss would be horrible.
  
  1 reply →
- sofixa 3 years ago
  
  That's why serious SAN vendors take care to provide you a mix of disks (e.g. on a brand new NetApp you can see that disks are of 2-3 different types, and with quite different serial numbers).
bink 3 years ago

Or even if the power supplies were purchased around the same time. I had a batch of servers that as soon as they arrived started chewing through hard drives. It took about 10 failed drives before I realized it was a problem with the power supplies.
adrianmonk 3 years ago
I learned this principle by getting a ticket for a burnt out headlight 1 week after I replaced the other one.
- hallway_monitor 3 years ago
  
  Anyone familiar with car repair will tell you that if one headlight burns out you should just go ahead and replace both, because of this exact phenomenon. I suppose with LEDs we may not have to worry about it anymore
0xbadcafebee 3 years ago
Even if they're not the same, they're written at the same time and rate, meaning they have the same wear over time, subject to the same power/heat issues, etc.
- pmoriarty 3 years ago
  
  Hopefully, regularly checking the disks' S.M.A.R.T status will help you stay on top of issues caused by those factors.
  Also, you shouldn't wait for disks to fail to replace them. HN's disks were used for 4.5 years, which is greater than the typical disk lifetime, in my experience. They should have replaced them sooner, one by one, in anticipation of failure. This would also allow them to stagger their disk purchases to avoid similar manufacturing dates.
  
  1 reply →

chippiewill 3 years ago

> Double disk failure is improbable but not impossible.

It's actually surprisingly common for failover hardware to fail shortly after the primary hardware. It's normally been exposed to similar conditions to what killed the primary and the strain of failing over pushes it over the edge.

GekkePrutser 3 years ago
Isn't that more for load balancing than failover?
For load balancing I would consider this very likely because both are equally loaded. But "failover" I would usually consider a scenario where a second server is purely in wait for the primary to fail, in which case it would be virtually unused. Like an active/passive scenario as someone mentioned below.
But perhaps I got my terminology mixed up. I'm not working with servers so much anymore.
- 0xbadcafebee 3 years ago
  
  If it's active/active failover then they get the same wear, if it's active/passive most of the components don't, but the storage might. Then again if it's active/passive, flaws can "hibernate" and get triggered exactly when failing over.
  You know how they say to always test your backups? Always test your failover too.

deltarholamda 3 years ago

>the failover server also failed

Those responsible for the sacking have also been sacked.

tgflynn 3 years ago

According to this comment: https://news.ycombinator.com/item?id=32024485

each server has a pair of mirrored disks, so it seems we're talking about 4 drives failing, not just 2.

On the other hand the primary seems to have gone down 6 hours before the backup server did, so the failures weren't quite simultaneous.

dang 3 years ago

> so it seems we're talking about 4 drives failing, not just 2.
Yes—I'm a bit unclear on what happened there, but that does seem to be the case.

marcosdumay 3 years ago

If you have an active/passive HA setup and don't test it periodically (by taking the active server offline and switching them afterwards), my guess is that double disk failures will be more common than single disk failures for you.

Still, I see no reason for prioritizing that failure mode on a site like HN.

sschueller 3 years ago

Depends on your vendor as well.

A long time ago we had a Dell server which was pre setup raid from Dell (don't ask, I didn't order it). Eventually one disk on this server died, what sucked was that the second disk in the raid array also failed only a few minutes later. We had to restore from backup which sucked but to our surprise when we opened the Dell server the two disks had sequential serial numbers. They came from the same batch at the same time. Not a good thing to do when you sell people pre configured raid systems at a mark up...

digitallyfree 3 years ago

By second disk failure do they mean that the disks on both the primary and fallback servers failed? Or do they mean that two disks (of a RAID1 or similar setup) in the fallback server failed?

The latter is understandable, the former would be quite a surprise for such a popular site. That means that the machines have no disk redundancy and the server is going down immediately on disk failure. The fallback server would be the only backup.

spiffytech 3 years ago

14 hours ago HN failed over to the standby due to a disk failure on the primary. 8 hours ago the standby's disk also failed.
Primary failure: https://news.ycombinator.com/item?id=32024036 Standby failure: https://twitter.com/HNStatus/status/1545409429113229312
dang 3 years ago
The disks on both the primary and fallback servers definitely failed. Each was in a RAID setup, but those failed too in both cases.
- digitallyfree 3 years ago
  
  Ouch! I'm assuming the disks were from the same batch and installed at the same time, but having at least four fail like that is just crazy unlucky.
  
  1 reply →

hunterb123 3 years ago

What was the test to determine the dataloss?

sillysaurusx 3 years ago
Informal. My last upvote was pretty close to when HN went down, so I expected my karma to go down, but it didn't.
Also I remember the "Why we're going with Rails" story on the front page from before it went down.
Cerium 3 years ago
I came to the same conclusion by observing that there are posts and comments from only eight hours ago.
- jbverschoor 3 years ago
  
  So that means dataloss.. Probably restored from backup.
  Good news for people who were banned, or for posts that didn't get enough momentum :)
  edit: Was restored from backup.. so def. dataloss
  
  12 replies →

bell-cot 3 years ago

I'm extremely curious about the makes & models of the failed hardware...

Pakdef 3 years ago

> Double disk failure is improbable but not impossible.

Were they connected on the same power supply? I had 4 different disks fail at the same time before, but they were all in the same PC... (lightning)

mikiem 3 years ago

They were in two mirrors, each mirror in a different server. Each server in different racks in the same row. The servers were on different power circuits from different panels.

dboreham 3 years ago

CP => !A

pastor_bob 3 years ago

is dang pushing changes and such on his own?

sounds like it is run by one guy

dang 3 years ago

I push changes on my own all the time, but the work of getting HN running again today was overwhelmingly done by my brilliant colleague mthurman.
jacquesm 3 years ago
What makes you think that? That's just a tweet from an unrelated account.
- pastor_bob 3 years ago
  
  Nevermind, I thought the OP ran that twitter account
sillysaurusx 3 years ago
HN will be around a hundred years. I think it's more than just a forum. We've seen lots of people coordinate during disasters, for example. Dan and his team do a good job running it. (I'm not a part of it.)
EDIT: My response was based on some edits that are now removed.
- rat9988 3 years ago
  
  You are overstimating HN way too much.
  
  14 replies →
swyx 3 years ago
theres two people fulltime on it but dang appears to be both DBA and SRE
- openthc 3 years ago
  
  And Mod; hope he gets three cheques