Comment by CableNinja

10 hours ago

5 9's is like 7 minutes a year. They are breaking SLAs and impacting services people depend on

Tbh though this is sort of all the other companies fault, "everyone" uses aws and cf and so others follow. now not only are all your chicks in one basket, so is everyone elses. When the basket inevitably falls into a lake....

Providers need to be more aware of their global impact in outages, and customers need to be more diverse in their spread.

> Providers need to be more aware of their global impact in outages

So you think the problem is they aren't "aware"?

  • These kinds of outages continue to happen and continue to impact 50+% of the internet, yes, they know they have that power, but they dont treat changes as such, so no, they arent aware. Awareness would imply more care in operations like code changes and deployments.

    Outages happen, code changes occur; but you can do a lot to prevent these things on a large scale, and they simply dont.

    Where is the A/B deployment, preventing a full outage? What about internally, where was the validation before the change, was the testing run against a prodlike environment or something that once resembled prod but hasnt forever?

    They could absolutely mitigate impacting the entire global infra in multiple ways, and havent, despite their many outages.

    • They are aware. They don't want to pay the cost benefit tradeoff. Education won't help - this is a very heavily argued tradeoff in every large software company.