Comment by jimmydorry

4 years ago

They had an outage which caused them to optimize the time to recover. However, this article didn't make any mention of regular testing of the new failover. If I'm reading it right, they designed their backup to not report on its health at all, and they instead have to just hope it works when they need it.

Was this not the completely wrong lesson to learn?

>Since our capacity injection is swift, we don't have to cautiously move the traffic by proxying to allow scaling policies to react. We can simply switch the DNS and open the floodgates, thus shaving even more precious minutes during an outage.

>We added filters in the shadow cluster to prevent the dark instances from reporting metrics. Otherwise, they will pollute the metric space and confuse the normal operating behavior.

>We also stopped the instances in the shadow clusters from registering themselves UP in discovery by modifying our discovery client. These instances will continue to remain in the dark (pun fully intended) until we trigger a failover.

0 comments

jimmydorry

No comments yet

Contribute on Hacker News ↗