Comment by andrepew
3 years ago
This is a huge one -- value in outsourcing blame. If you're down because of a major provider outage in the news, you're viewed more as a victim of a natural disaster rather than someone to be blamed.
3 years ago
This is a huge one -- value in outsourcing blame. If you're down because of a major provider outage in the news, you're viewed more as a victim of a natural disaster rather than someone to be blamed.
I hear this repeated so many times at my workplace, and it's so totally and completely uninformed.
Customers who have invested millions of dollars into making their stack multi-region, multi-cloud, or multi-datacenter aren't going to calmly accept the excuse that "AWS Went Down" when you can't deliver the services you contractually agreed to deliver. There are industries out there where having your service casually go down a few times a year is totally unacceptable (Healthcare, Government, Finance, etc). I worked adjacent to a department that did online retail a while ago and even an hour of outage would lose us $1M+ in business.
I wonder if the aggregate outage time from misconfigured and over-architected high availability services is greater than the average AWS outage per year.
Similar to security, the last few 9s of availability come at a heavily increasing (log) complexity / price. The cutoff will vary case by case, and I’m sure the decision on how many 9s you need is often irrational (CEO says it can never go down! People need their pet food delivered on time!).
> I hear this repeated so many times at my workplace, and it's so totally and completely uninformed.
> Customers who have invested millions of dollars into making their stack multi-region, multi-cloud, or multi-datacenter...
It sounds like the idea may be bad for your workplace, but that doesn't make it uninformed here. For the average B2C or business-to-small-business application, the customer doesn't even know what a region or datacenter is, all they know is that "the internet" isn't working and your service went down with it. These customers also don't have an SLA with guaranteed uptimes. The only thing they agreed to were the Terms and Conditions that explicitly say "no warranty, express or implied".
If you're selling to large enterprises, yeah, "AWS went down" won't cut it. But in most other cases it will.
> Customers who have invested millions of dollars > … > an hour of outage would lose us $1M+ in business
Given (excluding us-east-1) you’re looking at maybe an hour a year on average of regional outage, sounds like best case break even on that investment?
I'm going to say that an hour a year is wildly optimistic. But even then, that puts you at 4 nines (99.99%) which is comparatively awful, consider that an old fashioned telephone using technology from the 1970s will achieve on average, 5 9's of reliability, or 5.26 minutes of downtime per year, and that most IT shops operating their own infrastructure contractually expect 5 9's from even fairly average datacenters and transit providers.
1 reply →
This seems like a recently popular exaggeration, I'd wager no one but a select few in the HN-bubble actually cares.
You will primarily be judged by how much of an inconvenience the outage was to every individual.
The best you can hope for is that the local ISP gets the blame, but honestly. It can't be more than a rounding error in the end.
I think it's more of a shield against upper management. AWS going down is treated like an act of god rendering everyone blameless. But if it's your one big server that goes down then it's your fault.
>> AWS going down is treated like an act of god rendering everyone blameless.
Someone decided to use AWS, so there is blame to go around. I'm not saying if that blame is warranted or not, just that it sounds like a valid thing to say for people who want to blame someone.
4 replies →
Agreed. Recently I was discussing the same point with a non-technical friend who was explaining that his CTO had decided to move from Digital Ocean to AWS, after DO experienced some outage. Apparently the CEO is furious at him and has assumed that DO are the worst service provider because their services were down for almost an entire business day. The CTO probably knows that AWS could also fail in a similar fashion, but by moving to AWS it becomes more or less an Act of God type of situation and he can wash his hands of it.
I find this entire attitude disappointing. Engineering has moved from "provide the best reliability" to "provide the reliability we won't get blamed for the failure of". Folks who have this attitude missed out on the dang ethics course their college was teaching.
If rolling your own is faster, cheaper, and more reliable (it is), then the only justification for cloud is assigning blame. But you know what you also don't get? Accolades.
I throw a little party of one here when Office 365 or Azure or AWS or whatever Google calls it's cloud products this week is down but all our staff are able to work without issue. =)
"Value in outsourcing blame"
The real reason that talented engineers secretly support all of the middle management we vocally complain about.
If you work in B2B you can put the blame on Amazon and your customers will ask "understandable, take the necessary steps to make sure it doesn't happen again". AWS going down isn't an act of God, it's something you should've planned for, especially if it happened before.
So it does not really work in B2B.
I don't really have much to do with contracts - but my company is stating that we have up time of 99.xx%.
In terms of contract customers don't care if I have Azure/AWS or I keep my server in the box under the stairs. Yes they do due diligence and would not buy my services if I keep it in shoe box.
But then if they loose business they come to me .. I can go after Azure/AWS but I am so small they will throw some free credits and me and tell to go off.
Maybe if you are in B2C area then yeah - your customers will probably shrug and say it was M$ or Amazon if you write sad blog post with excuses.
It's going to depend on the penalties for being unavailable. Small B2B customers are very different from enterprise B2B customers too, so you ultimately have to build for your context.
If you have to give service credits to customers then with "one box" you have to give 100% of customers a credit. If your services are partitioned across two "shards" then one of those shards can go down, but your credits are only paid out at 50%.
Getting to this place doesn't prevent a 100% outage and it imposes complexity. This kind of design can be planned for enterprise B2B apps when the team are experienced with enterprise clients. Many B2B SaaS are tech folk with zero enterprise experience, so they have no idea of relatively simple things that can be done to enable a shift to this architecture.
Enterprise customers do care where things are hosted. They very likely have some users in the EU, or other locations, which care more about data protection and sovereignty than the average US organization. Since they are used to hosting on-prem and doing their own due diligence they will often have preferences over hosting. In industries like healthcare, you can find out what the hosting preferences are, as well as understand how the public clouds are addressing them. While not viewed as applicable by many on HN due to the focus on B2C and smaller B2B here, this is the kind of thing that can put a worse product ahead in the enterprise scenario.
Depends on scale of B2B. Between enterprises, not as much. Between small businesses, works very well (at least in my experience, we are tiny B2B).
It really varies a lot. I have seen very large lazy sites suddenly pick up a client that wanted RCA for each bad transaction, and suddenly get religion quickly (well quickly as a large org can). Those are precious clients because they force investment into useful directions of availability instead of just new features.
Because you have a vendor/customer relationship. The big thing for AWS is employer/employee relationships. If you were a larger company, and AWS goes down, who blames you? Who blames anyone in the company? At the C-level, does the CEO expect more uptime than Amazon? Of course not. And so it goes.
Whereas if you do something other than the industry standard of AWS (or Azure/GCP) and it goes down, clearly it's your fault.