← Back to context

Comment by geoduck14

4 years ago

My work is doing planned DR testing right now. One key system had some problems failing over - so they delayed failing back, then they decided to change when they were going to fail back.

Each time they changed their plans, there were war rooms discussing the "impact" of them changing plans. In each meeting, I'm scratching my head: this the closest thing to an actual disaster and we are all in a tizzy.

Doing regular fail overs is a healthy practice, but it's also a test of the company.

A large outage is just about the only thing that can convince management that it is important.

Reducing technical debt and staying current with technology ought to be a priority, but all too often it's not.

Netflix learned it the hard way:

https://opensource.com/article/18/4/how-netflix-does-failove...

  • They had an outage which caused them to optimize the time to recover. However, this article didn't make any mention of regular testing of the new failover. If I'm reading it right, they designed their backup to not report on its health at all, and they instead have to just hope it works when they need it.

    Was this not the completely wrong lesson to learn?

    >Since our capacity injection is swift, we don't have to cautiously move the traffic by proxying to allow scaling policies to react. We can simply switch the DNS and open the floodgates, thus shaving even more precious minutes during an outage.

    >We added filters in the shadow cluster to prevent the dark instances from reporting metrics. Otherwise, they will pollute the metric space and confuse the normal operating behavior.

    >We also stopped the instances in the shadow clusters from registering themselves UP in discovery by modifying our discovery client. These instances will continue to remain in the dark (pun fully intended) until we trigger a failover.

Reminds me, my friend recently had a "scoping" meeting. It was planned for 2 hours. Lasted 9. Obviously they needed a scoping meeting to scope their scoping meeting.

Ah, I love when people decide they need committees for plans for their committees.