Comment by rubiquity
3 years ago
The line of thinking you follow is what is plaguing this industry with too much complexity and simultaneously throwing away incredible CPU and PCIe performance gains in favor of using the network.
Any technical decisions about how many instances to have and how they should be spread out needs to start as a business decision and end in crisp numbers about recovery point/time objections, and yet somehow that nearly never happens.
To answer your points:
1) Not necessarily. You can stream data backups to remote storage and recover from that on a new single server as long as that recovery fits your Recovery Time Objective (RTO).
2) What's the benefit of multiple AZs if the SLA of a single AZ is greater than your intended availability goals? (Have you checked your provider's single AZ SLA?)
3) You can absolutely do rolling deploys on a single server.
4) Using one large server doesn't mean you can't compliment it with smaller servers on an as-needed basis. AWS even has a service for doing this.
Which is to say: there aren't any prescriptions when it comes to such decisions. Some businesses warrant your choices, the vast majority do not.
> Any technical decisions about how many instances to have and how they should be spread out needs to start as a business decision and end in crisp numbers about recovery point/time objections, and yet somehow that nearly never happens.
Nobody wants to admit that their business or their department actually has a SLA of "as soon as you can, maybe tomorrow, as long as it usually works". So everything is pretend-engineered to be fifteen nines of reliability (when in reality it sometimes explodes because of the "attempts" to make it robust).
Being honest about the actual requirements can be extremely helpful.
> Nobody wants to admit that their business or their department actually has a SLA of "as soon as you can, maybe tomorrow, as long as it usually works". So everything is pretend-engineered to be fifteen nines of reliability (when in reality it sometimes explodes because of the "attempts" to make it robust).
I have yet to see my principal technical frustrations summarized so concisely. This is at the heart of everything.
If the business and the engineers can get over their ridiculous obsession of statistical outcomes and strict determinism, they would be able to arrive at a much more cost effective, simple and human-friendly solution.
The # of businesses that are actually sensitive to >1 minute of annual downtime are already running on top of IBM mainframes and have been for decades. No one's business is as important as the federal reserve or pentagon, but they don't want to admit it to themselves or others.
> The # of businesses that are actually sensitive to >1 minute of annual downtime are already running on top of IBM mainframes and have been for decades.
Is there any?
My bank certainly has way less than 5 9s of availability. It's not a problem at all. Credit/debit card processors seem to stay around 5 nines, and nobody is losing sleep over it. As long as your unavailability isn't all on the Christmas promotion day, I never saw anybody losing any sleep over web-store unavailability. The FED probably doesn't have 5 9's of availability. It's way overkill for a central bank, even if it's one that process online interbank transfers (what the FED doesn't).
The organizations that need more than 5 9's are probably all on the military and science sectors. And those aren't using mainframes, they certainly use good old redundancy of equipment with simple failure modes.
> simultaneously throwing away incredible CPU and PCIe performance gains
We really need to double down on this point. I worry that some developers believe they can defeat the laws of physics with clever protocols.
The amount of time it takes to round trip the network in the same datacenter is roughly 100,000 to 1,000,000 nanoseconds.
The amount of time it takes to round trip L1 cache is around half a nanosecond.
A trip down PCIe isn't much worse, relatively speaking. Maybe hundreds of nanoseconds.
Lots of assumptions and hand waving here, but L1 cache can be around 1,000,000x faster than going across the network. SIX orders of magnitude of performance are instantly sacrificed to the gods of basic physics the moment you decide to spread that SQLite instance across US-EAST-1. Sure, it might not wind up a million times slower on a relative basis, but you'll never get access to those zeroes again.
> 2) What's the benefit of multiple AZs if the SLA of a single AZ is greater than your intended availability goals? (Have you checked your provider's single AZ SLA?)
… my providers single AZ SLA is less than my company's intended availability goals.
(IMO our goals are also nuts, too, but it is what it is.)
Our provider, in the worse case (a VM using a managed hard disk) has an SLA of 95% within a month (I … think. Their SLA page uses incorrect units on the top line items. The examples in the legalese — examples are normative, right? — use a unit of % / mo…).
You're also assuming a provider a.) typically meets their SLAs and b.) if they don't, honors them. IME, (a) is highly service dependent, with some services being just stellar at it, and (b) is usually "they will if you can prove to them with your own metrics they had an outage, and push for a credit. Also (c.) the service doesn't fail in a way that's impactful, but not covered by SLA. (E.g., I had a cloud provider once whose SLA was over "the APIs should return 2xx", and the APIs during the outage, always returned "2xx, I'm processing your request". You then polled the API and got "2xx your request is pending". Nothing was happening, because they were having an outage, but that outage could continue indefinitely without impacting the SLA! That was a fun support call…)
There's also (d) AZs are a myth; I've seen multiple global outages. E.g., when something like the global authentication service falls over and takes basically every other service with it. (Because nothing can authenticate. What's even better is the provider then listing those services as "up" / not in an outage, because technically it's not that service that's down, it is just the authentication service. Cause God forbid you'd have to give out that credit. But the provider calling a service "up" that is failing 100% of the requests sent its way is just rich, from the customer's view.)
I agree! Our "distributed cloud database" just went down last night for a couple of HOURS. Well, not entirely down. But there were connection issues for hours.
Guess what never, never had this issue? The hardware I keep in a datacenter lol!
> The line of thinking you follow is what is plaguing this industry with too much complexity and simultaneously throwing away incredible CPU and PCIe performance gains in favor of using the network.
It will die out naturally once people realize how much the times have changed and that the old solutions based on weaker hardware are no longer optimal.
Ok, so to your points.
"It depends" is the correct answer to the question, but the least informative.
One Big Server or multiple small servers? It depends.
It always depends. There are many workloads where one big server is the perfect size. There are many workloads where many small servers are the perfect solution.
What my point is, is that the ideas put forward in the article are flawed for the vast majority of use cases.
I'm saying that multiple small servers are a better solution on a number of different axis.
For 1) "One Server (Plus a Backup) is Usually Plenty" Now I need some kind of remote storage streaming system and some kind of manual recovery, am I going to fail over to the backup (and so it needs to be as big as my "One server" or will I need to manually recover from my backup?
2) Yes it depends on your availability goals, but you get this as a side effect of having more than one small instance
3) Maybe I was ambiguous here. I don't just mean rolling deploys of code. I also mean changing the server code, restarting, upgrading and changing out the server. What happens when you migrate to a new server (when you scale up by purchasing a different box). Now we have a manual process that doesn't get executed very often and is bound to cause downtime.
4) Now we have "Use one Big Server - and a bunch of small ones"
I'm going to add a final point on reliability. By far the biggest risk factor for reliability is me the engineer. I'm responsible for bringing down my own infra way more than any software bug or hardware issue. The probability of me messing up everything when there is one server that everything depends on is much much higher, speaking from experience.
So. Like I said, I could have said "It depends" but instead I tried to give a response that was someway illuminating and helpful, especially given the strong opinions expressed in the article.
I'll give a little color with the current setup for a site I run.
moustachecoffeeclub.com runs on ECS
I have 2 on-demand instances and 3 spot instances
One tiny instance running my caches (redis, memcache) One "permanent" small instance running my web server
Two small spot instances running web server One small spot instance running background jobs
small being about 3 GB and 1024 CPU units
And an RDS instance with backup about $67 / month
All in I'm well under $200 per month including database.
So you can do multiple small servers inexpensively.
Another aspect is that I appreciate being able to go on vacation for a couple of weeks, go camping or take a plane flight without worrying if my one server is going to fall over when I'm away and my site is going to be down for a week. In a big company maybe there is someone paid to monitor this, but with a small company I could come back to a smoking hulk of a company and that wouldn't be fun.
> All in I'm well under $200 per month including database.
You forgot all the crucial numbers.. Like QPS.. My blog runs on 0 to 1 Cloud Run instances and costs < 3$ per month, including database
You should be using One Big Server mate :)