Comment by toredash

10 hours ago

> Most people should start with a single-zone setup and just accept that there's a risk associated with zone failure. If you have a single-zone setup, you have a node group in that one zone, you have the managed database in the same zone, and you're done.

I don't disagree, but there is one issue with this approach and that is that RDS is a multi AZ service by itself. That means that when a maintenance event occur on your insaance, AWS will start a new instance in a new zone, and fail over to that one.

You could of course manually failover RDS afterwards to your primary zone. Not sure if that is better than manually scaling up a node pool if a zone fails.

> So you are presuming that, when RDS automatically fails over to zone b to account for zone a failure, that you will certainly be able to scale up a full scale production environment in zone b as well, in spite of nearly every other AWS customer attempting more or less the same strategy;

Thats up to the user to decide via the Kyverno policy. We used the preferredDuringSchedulingIgnoredDuringExecution affinity setting to instruct the scheduler to attempt to schedule the pods in the optimal zone.

I believe the only way to be 100% sure that you have compute capacity available in your AWS account is the use EC2 On-Demand Capacity Reservations (https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capa...). If your current zone is at full capacity, and for some reason the nodes your VMs are running on dies, that capacity is lost, and you wont get it back either.

2 comments

toredash

solatic 8 hours ago

> That means that when a maintenance event occur on your insaance, AWS will start a new instance in a new zone, and fail over to that one.

Not true for single-AZ deployments. There is downtime during the maintenance event, but this is also true in multi-AZ deployments when the instance in the second AZ is promoted; a multi-AZ maintenance window has slightly less downtime, but not much; downtime is downtime, but generally not enough to affect a 99.9% SLA anyway.

> EC2 On-Demand Capacity Reservations

Also quite expensive to maintain just for outage recovery events.

The point I'm trying to make is that formal risk analysis forces you to think about actual sources of risk, and SRE/FinOps principles force you think about how much budget you are willing to spend to address those risks. And I don't understand how a tool like this fits into formal risk analysis and where it presents an optimum solution for those risks.

toredash 6 hours ago

> And I don't understand how a tool like this fits into formal risk analysis and where it presents an optimum solution for those risks.
Seems it does not fit your risk analysis?