Comment by oceanplexian
3 years ago
I hear this repeated so many times at my workplace, and it's so totally and completely uninformed.
Customers who have invested millions of dollars into making their stack multi-region, multi-cloud, or multi-datacenter aren't going to calmly accept the excuse that "AWS Went Down" when you can't deliver the services you contractually agreed to deliver. There are industries out there where having your service casually go down a few times a year is totally unacceptable (Healthcare, Government, Finance, etc). I worked adjacent to a department that did online retail a while ago and even an hour of outage would lose us $1M+ in business.
I wonder if the aggregate outage time from misconfigured and over-architected high availability services is greater than the average AWS outage per year.
Similar to security, the last few 9s of availability come at a heavily increasing (log) complexity / price. The cutoff will vary case by case, and I’m sure the decision on how many 9s you need is often irrational (CEO says it can never go down! People need their pet food delivered on time!).
> I hear this repeated so many times at my workplace, and it's so totally and completely uninformed.
> Customers who have invested millions of dollars into making their stack multi-region, multi-cloud, or multi-datacenter...
It sounds like the idea may be bad for your workplace, but that doesn't make it uninformed here. For the average B2C or business-to-small-business application, the customer doesn't even know what a region or datacenter is, all they know is that "the internet" isn't working and your service went down with it. These customers also don't have an SLA with guaranteed uptimes. The only thing they agreed to were the Terms and Conditions that explicitly say "no warranty, express or implied".
If you're selling to large enterprises, yeah, "AWS went down" won't cut it. But in most other cases it will.
> Customers who have invested millions of dollars > … > an hour of outage would lose us $1M+ in business
Given (excluding us-east-1) you’re looking at maybe an hour a year on average of regional outage, sounds like best case break even on that investment?
I'm going to say that an hour a year is wildly optimistic. But even then, that puts you at 4 nines (99.99%) which is comparatively awful, consider that an old fashioned telephone using technology from the 1970s will achieve on average, 5 9's of reliability, or 5.26 minutes of downtime per year, and that most IT shops operating their own infrastructure contractually expect 5 9's from even fairly average datacenters and transit providers.
I was amused when I joined my current company to find that our contracts only stipulate one 9 of reliability (98%). So ~30 mins a day or ~14 hours a month is permissible.