Comment by ebiederm

3 days ago

I don't know if this is realistic but as a general rule if I was contracting with someone so that my business would have higher reliability, I would ask for a service level agreement with a agreed upon amount the vendor will pay you for every unit of time there service is not up.

At least then your pain is their pain, and they are incentivesed to prevent problems and fix them quickly.

Usually those agreements either just give you credits for the same service, pay way less than you lost or basically everything falls under force majeure.

If it works for you that's great, but when the actual shit hits the fan I don't think you should expect actual compensation.

At our scale I doubt if we can get any cloud provider to write custom contracts. But if I had negotiating power, I completely agree.

  • Nobody that uses Kubernetes and random shit from Github would sign such an agreement if they actually had to pay out and could not weasel their way out of it. That would be signing up for a near-unlimited liability and business suicide.

    Let's assume an incident costs you (the customer) ~5k, just assuming the time it takes to get a professional on very short notice to debug (since the whole promise of managed services is that you no longer need technical staff at all). That's also ignoring the actual cost to your business (lost sales, reputational risk, or missing your own SLAs).

    For the provider to be willing to pay out something like this they'd need to charge you monthly several times that amount (otherwise just one incident and they're forever underwater on the LTV). Yet such a monthly amount would make the service unaffordable to all but the most deep-pocketed customers... for whom the impact of an outage on their business would cost even more meaning they'd want the payouts to be even bigger, leading to a catch-22.

    High-availability good enough for the provider to put 5-figure sums on the line is actually really hard (there's a reason actual critical stuff like stock exchange order processing or card transactions don't run on the "cloud", nor on Kubernetes for that matter), so the next best thing is make-believe "high availability" where everyone (except the occasional poor soul like you that actually believed the marketing) understands the charade and plays along (because their own SLAs are often make-believe too).

    See also: the recent Cloudflare or AWS outages.