Comment by sharken
4 years ago
Doing regular fail overs is a healthy practice, but it's also a test of the company.
A large outage is just about the only thing that can convince management that it is important.
Reducing technical debt and staying current with technology ought to be a priority, but all too often it's not.
Netflix learned it the hard way:
https://opensource.com/article/18/4/how-netflix-does-failove...
They had an outage which caused them to optimize the time to recover. However, this article didn't make any mention of regular testing of the new failover. If I'm reading it right, they designed their backup to not report on its health at all, and they instead have to just hope it works when they need it.
Was this not the completely wrong lesson to learn?
>Since our capacity injection is swift, we don't have to cautiously move the traffic by proxying to allow scaling policies to react. We can simply switch the DNS and open the floodgates, thus shaving even more precious minutes during an outage.
>We added filters in the shadow cluster to prevent the dark instances from reporting metrics. Otherwise, they will pollute the metric space and confuse the normal operating behavior.
>We also stopped the instances in the shadow clusters from registering themselves UP in discovery by modifying our discovery client. These instances will continue to remain in the dark (pun fully intended) until we trigger a failover.