Comment by solatic

5 hours ago

I don't really understand why you think this tool is needed and what exact problem/risk it's trying to solve.

Most people should start with a single-zone setup and just accept that there's a risk associated with zone failure. If you have a single-zone setup, you have a node group in that one zone, you have the managed database in the same zone, and you're done. Zone-wide failure is extremely rare in practice and you would be surprised at the number (and size of) companies that run single-zone production setups to save on cloud bills. Just write the zone label selector into the node affinity section by hand, you don't need a fancy admission webhook if you want to reduce chance's factor.

If you decide that you want to handle the additional complexity of supporting failover in case of zone failure, the easiest approach is to just setup another node group in the secondary zone. If the primary zone fails, manually scale up the node pool in the secondary zone. Kubernetes will automatically schedule all the pods on the scaled up node pool (remember: primary zone failure, no healthy nodes in the primary zone), and you're done.

If you want to handle zone failover completely automatically, this tool represents additional cost, because it forces you to have nodes running in the secondary zone during normal usage. Hopefully you are not running a completely empty, redundant set of service VMs in normal operation, because that would be a colossal waste of money. So you are presuming that, when RDS automatically fails over to zone b to account for zone a failure, that you will certainly be able to scale up a full scale production environment in zone b as well, in spite of nearly every other AWS customer attempting more or less the same strategy; roughly half of zone a traffic will spill over to zone b, roughly half to zone c, minus all the traffic that is zone-locked to a (e.g. single-zone databases without failover mechanisms). That is a big assumption to make and you run a serious risk of not getting sufficient capacity in what was basically an arbitrarily chosen zone (chosen without context on whether there is sufficient capacity for the rest of your workloads) and being caught with zonal mismatches and not knowing what to do. You very well might need to failover to another region entirely to get sufficient capacity to handle your full workload.

If you are both cost- and latency-sensitive to stick to a single zone, you're likely much better off coming up with a migration plan, writing an automated runbook/script to handle it, and testing it on gamedays.

1 comment

solatic

stronglikedan 4 hours ago

> I don't really understand why you think this tool is needed and what exact problem/risk it's trying to solve.

They lay out the problem and solution pretty well in the link. If you still don't understand after reading it, then that's okay! It just means you're not having this problem and you're not in need of this tool, so go you! But at least you'll come away with the understanding that someone was having this problem and someone needed this tool to solve it, so win win win!