Comment by everfrustrated
10 hours ago
How is Azure still having faults that affect multiple regions? Clearly their region definition is bollocks.
10 hours ago
How is Azure still having faults that affect multiple regions? Clearly their region definition is bollocks.
All 3 hyperscalers have vulnerabilities in their control planes: they're either single point of failure like AWS with us-east-1, or global meaning that a faulty release can take it down entirely; and take AZ resilience to mean that existing compute will continue to work as before, but allocation of new resources might fail in multi-AZ or multi-region ways.
It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.
> It means that any service designed to survive a control plane outage must statically allocate its compute resources and have enough slack that it never relies on auto scaling. True for AWS/GCP/Azure.
That sounds oddly similar to owning hardware.
In a way. It means that you can get new capacity most often, but the transition windows where a service gets resized (or mutated in general) has to be minimised and carefully controlled by ops.
This outage talks about what appears to be a VM control plane failure (it mentions stop not working) across multiple regions.
AWS has never had this type of outage in 20 years. Yet Azure constantly had them.
This is a total failure of engineering and has nothing to do with capacity. Azure is a joke of a cloud.
AWS had an outage that blocked all EC2 operations just a few months ago: https://aws.amazon.com/message/101925/
1 reply →
I do agree that Azure seems to be a lot worse: its control plane(s) seems to be much more centralized than the other two.