Comment by davidmurdoch
8 hours ago
Speaking of 5 9s, how would you achieve 5 9s for a basic CRUD app that doesn't need to scale, but still be globally accessible? No auth, micro services, email or 3rd party services. Just a classic backend connected to a db (any db tech, hosted wherever), that serves up some html.
It depends on the infrastructure you're running on. There was a post yesterday going fairly into depth how you do such calculations https://authress.io/knowledge-base/articles/2025/11/01/how-w...
You probably cannot achieve this with a single node, so you'll at least need to replicate it a few times to combat the normal 2-3 9s you get from a single node. But then you've got load balancers and dns, which can also serve as single point of failure, as seen with cloudflare.
Depending on the database type and choice, it varies. If you've got a single node of postgres, you can likely never achieve more than 2-3 9s (aws guarantees 3 9s for a multi-az RDS). But if you do multi-master cockroach etc, you can maybe achieve 5 9s just on the database layer, or using spanner. But you'll basically need to have 5 9s which means quite a bit of redundancy in all the layers going to and from your app and data. The database and DNS being the most difficult.
Reliable DNS provider with 5 9s of uptime guarantees -> multi-master load balancer each with 3 9s, -> each load balancer serving 3 or more apps each with 3 9s of availability, going to a database(s) with 5 9s.
This page from google shows their uptime guarantees for big tables, 3 9s for a single region with a cluster. 4 9s for multi cluster and 5 9s for multi region
https://docs.cloud.google.com/architecture/infra-reliability...
In general it doesn't matter really what you're running, it is all about redundancy. Whether that is instances, cloud vendor, region, zone etc.
Part of the up-time solution is keeping as much of your app and infrastructure within your control, rather than being at the behest of mega-providers as we've witnessed in the past month: Cloudflare, and AWS.
Probably:
- a couple of tower servers, running Linux or FreeBSD, backed up by a UPS and an auto-run generator with 24 hours worth of diesel (depending on where you are, and the local areas propensity for natural disasters - maybe 72 hours),
- Caddy for a reverse proxy, Apache for the web server, PostgreSQL for the database;
- behind a router with sensible security settings, that also can load-balance between the two servers (for availability rather than scaling);
- on static WAN IPs,
- with dual redundant (different ISPs/network provider) WAN connections,
- a regular and strictly followed patch and hardware maintenance cycle,
- located in an area resistant to wildfire, civil unrest, and riverine or coastal flooding.
I'd say that'd get you close to five 9s (no more than ~5 minutes downtime per year), though I'd pretty much guarantee five 9s (maybe even six 9s - no more than 32 seconds downtime per year) if the two machines were physically separated from each other by a few hundred kilometres, each with their own supporting infrastructure above, sans the load balancing (see below), through two separate network routes.
Load balancing would become human-driven in this 'physically separate' example (cheaper, less complex): if your-site-1.com fails, simply re-point your browser to your-site-2.com which routes to the other redundant server on a different network.
The hard part now will be picking network providers that don't use the same pipes/cables, i.e. they both use Cloudflare, or AWS...
Keep the WAN IPs written down in case DNS fails.
PostgreSQL can do master-master replication, but it's a pain to set up I understand.
what if you could create a super virtual server of sorts. imagine a new cloud provider like vercel but called something else. what this provider does is when you create a server on their service, they create 3 services, one on aws, one on gcp and one on azure. behind the scenes they are 3 separate servers but to the end user they are a single server. the end user gets to control how many cloud providers are involved. when aws goes down, no worries, it switches to the part with gcp on
Stock VPS somewhere like OVH or Hetzner, with a replica in a different provider?
Doesn't Hetzner carry the risk if getting kicked off on a whim? Only time I hear about them is when someone gets kicked out.