Use One Big Server (2022)

1 month ago (specbranch.com)

One of the more detrimental aspects of the Cloud Tax is that it constrains the types of solutions engineers even consider.

Picking an arbitrary price point of $200/mo, you can get 4(!) vCPUs and 16GB of RAM at AWS. Architectures are different etc., but this is roughly a mid-spec dev laptop of 5 or so years ago.

At Hetzner, you can rent a machine with 48 cores and 128GB of RAM for the same money. It's hard to overstate how far apart these machines are in raw computational capacity.

There are approaches to problems that make sense with 10x the capacity that don't make sense on the much smaller node. Critically, those approaches can sometimes save engineering time that would otherwise go into building a more complex system to manage around artificial constraints.

Yes, there are other factors like durability etc. that need to be designed for. But going the other way, dedicated boxes can deliver more consistent performance without worries of noisy neighbors.

  • It's more than that - it's all the latency that you can remove from the equation with your bare-metal server.

    No network latency between nodes, less memory bandwidth latency/contention as there is in VMs, no caching architecture latency needed when you can just tell e.g. Postgres to use gigs of RAM and then let Linux's disk caching take care of the rest (and not need a separate caching architecture).

    • The difference between a fairly expensive ($300) RDS instance + EC2 in the same region vs a $90 dedicated server with a NVME drive and postgres in a container is absolutely insane.

      36 replies →

  • In 2025 if you need convenience and no red tape you've got fly.io in the general case and maybe Vercel or something on a particular framework (there are some good ones for a particular stack).

    If your needs go beyond that? Then you need real computers with real configuration and you have OVH/Hetzner/Latitude who will rent you MONSTER machines for the cost of some cheap-ass surplus 2017 Intel on The Cloud.

    And if you just want a blog or whatever? Zillion VPS options.

    The traditional cloud is for regulatory/process/corruption capture extraction in 2025: its machine economics and developer productivity use case is fucking zero I've seen. Maybe there's some edge case where a completely unencumbered team is better off with DMV trip permissions theatre, remnant Intel racked with noisy neighbors at massive markup, and no support recourse.

    • (1) How does fly.io reliability compare to AWS, GCP, or maybe Linode or DO?

      (2) What do you do if your large Hetzner server starts to show signs of malfunction? How soon would you be able to replace it, and how easily?

      (2a) What do you do when your large Hetzner server just dies? I see that this happens rarely, but what's your contingency plan, if any?

      (3) What do you do when your load is highly spiky? Do you reserve bare metal capacity for the biggest peak you expect to serve, because it's so much cheaper than running an elastic serverless architecture of the same capacity anyway?

      (4) Considering that your stack still includes many components, how do you manage them, and how expensive is the management overhead? Do you need an extra SRE?

      These are not rhetorical questions; I'd love to hear firm real practitioners! (E.g. Stack Overflow used to do deep dives into their few-big-servers architecture.)

      1 reply →

  • I don't get why people are so hell-bent on going to AWS, for the most minor applications, without looking at simpler options!

    I am not even thousands km near the level of what you are doing, but my client was paying $100/m for an AWS server, SQS and S3 bucket, for a small PHP based web application that uses Amazon Seller API, Keepa API for the products he ships. Used MySQL for data storage.

    I implemented the whole thing in Python, Django, and PostgreSQL (initially used SQLite) put it in a $25/m unmanaged VPS.

    I have not got any complaints about performance, and it's running continuously updating product prices, details, processing PDF invoices using OCR, finding missing products in shipments, while also serving the website, and a 4 core server with 6GB RAM is handling it just fine.

    The load is not going to be so high to require AWS and friends, for now. It's a small internal app, probably won't even get over 100 users, and if it ever does, it's extremely simple to migrate, because the app is so compact, even though not exactly monolithic.

    And still, it probably won't need a $100 AWS server, unless we are scaling up much larger.

    • AWS is useful for big business. Automatic multi region failover and hosted databases may be expensive, but they're a massive pain to manually configure and an easy footgun if you're not used to doing that sort of thing. Plus, with Amazon you already have public toolkits to use those features with all of your services, so you don't need to figure how to integrate/what open source system to use to accomplish all of that. Plus, if you go for your own physical server, you need to arrange parts and maintenance windows for any hardware that will eventually fail.

      If all you need is "good enough" reliability and basic compute power (which I think is good enough for many businesses, considering AWS isn't exactly outage free either), you're probably better off getting a server or renting one from a cheap cloud host. If you're promising five nines of uptime for some reason, you may want to reconsider.

      1 reply →

    • Without understanding the architecture and use case better, at first read, my gut says that isn’t an AWS problem - it sounds like a solutions architecture problem.

      There are cheaper ways of building that use case on AWS.

      Most AWS sticker shock I’ve seen results from someone who doesn’t really understand cloud trying to build on the cloud. Cost has to be designed in from the start (in addition to security, operational overhead, etc).

      In general, I’ve found two types of engineering teams who don’t use the cloud: the mugs and the superstars. And since superstars are few and far between, that means…

      2 replies →

  • 100% this add an embedded database like sqlite and optimise writes to batch and you can go really really far with hetzner. It's also why I find the "what about overprovisioning" argument silly (once you look outside of AWS you can get insane cost/perf ratio).

    Also in my experience more complex systems tend to have much less reliability/resilience than simple single node systems. Things rarely fail in isolation.

  • I think it’s the other way around. I’m a huge fan of Hetzner for small sites with a few users. However, for bigger projects, the cloud seems to offer a complete lack of constraints. For projects that can pay for my time, $200/m or $2000/m in hosting costs is a negligible difference. What’s the development cost difference between AWS CDK / Terraform + GitHub Actions vs. Docker / K8s / Ansible + any CI pipeline? I don’t know; in my experience, I don’t see how “bare metal” saves much engineering time. I also don’t see anything complicated about using an IaC Fargate + RDS template.

    Now, if you actually need to decouple your file storage and make it durable and scalable, or need to dynamically create subdomains, or any number of other things… The effort of learning and integrating different dedicated services at the infrastructure level to run all this seems much more constraining.

    I’ve been doing this since before the “Cloud,” and in my view, if you have a project that makes money, cloud costs are a worthwhile investment that will be the last thing that constrains your project. If cloud costs feel too constraining for your project, then perhaps it’s more of a hobby than a business—at least in my experience.

    Just thinking about maintaining multiple cluster filesystems and disk arrays—it’s just not what I would want to be doing with most companies’ resources or my time. Maybe it’s like the difference between folks who prefer Arch and setting up Emacs just right, versus those happy with a MacBook. If I felt like changing my kernel scheduler was a constraint, I might recommend Arch; but otherwise, I recommend a MacBook. :)

    On the flip side, I’ve also tried to turn a startup idea into a profitable project with no budget, where raw throughput was integral to the idea. In that situation, a dedicated server was absolutely the right choice, saving us thousands of dollars. But the idea did not pan out. If we had gotten more traction, I suspect we would have just vertically scaled for a while. But it’s unusual.

    • > I really don't see how "bare metal" saves any engineering time

      This is because you are looking only at provisioning/deployment. And you are right -- node size does not impact DevOps all that much.

      I am looking at the solution space available to the engineers who write the software that ultimately gets deployed on the nodes. And that solution space is different when the nodes have 10x the capability. Yes, cloud providers have tons of aggregate capability. But designing software to run on a fleet of small machines is very different from accomplishing the same tasks on a single large machine.

      It would not be controversial to suggest that targeting code at an Apple Watch or Raspberry Pi imposes constraints on developers that do not exist when targeting desktops. I am saying the same dynamic now applies to targeting cloud providers.

      This isn't to say there's a single best solution for everything. But there are tradeoffs that are now always apparent. The art is knowing when it makes sense to pay the Cloud Tax, and whether to go 100% Cloud vs some proportion of dedicated.

      3 replies →

    • > I’m a huge fan of Hetzner ... I don’t see how “bare metal” saves much engineering time.

      I think you confuse Heztner with bare metal. Hetzner has Hetzner Cloud which is like AWS EC2 but much cheaper. (They also have bare metal servers which are even cheaper.) With Heztner Cloud, you can use Terraform, Github Actions and whatever else you mentioned.

      1 reply →

  • > types of solutions engineers even consider

    I think the issue is actually the opposite.

    With the cloud, the engineers fail to see the actual cost of their inefficient scaled-out code, because someone else (the CFO) pays the bill; and the answer to any issue, is simply adding more "workers" and more "cloud", since they're basically "free" from the perspective of the employee. (And the more "cloud" something is, like, the serverless, the more "free", completely inverting the economics of making a profit on the service — when the CFO tells you that your AWS bill is too high, you move everything from the EC2 to AWS Lambda, since the salesperson from AWS tells you that serverless is far cheaper, only for the bill to get even higher, for reasons unknown, of course.)

    Whom the cloud tax actually constrains are the entrepreneurs and solo-preneurs. If you have to pay $5000/mo to AWS just for the infra, you can only go so long without lots of revenue, and you'd need to have a whopping 5k/mo+ worth of revenue before breaking even. Yet with a $200/mo like at OVH or Hetzner, you can afford to let it grow at negligible cost to yourself, and it can basically start being profitable with the first few users.

    Don't believe this? Look at the blog entries by the guy who bought Yahoo!'s Delicious, written before they went bankrupt and were up for sale. He was basically pointing out that the services have roughly the same number of users, and require the same engineering resources, yet one is being operated at a loss, whereas the other one makes a profit (guess which one, and guess why).

    * https://en.wikipedia.org/wiki/Delicious_(website)

    * https://en.wikipedia.org/wiki/Pinboard_(website)

    * https://news.ycombinator.com/from?site=blog.pinboard.in

    So, literally, the difference between the cloud and renting One Big Server, is making a loss and going out of business, and remaining in business and purchasing your underwater competitor for pennies on the dollar.

  • I agree that AWS EC2 is probably too expensive on the whole. It also doesn't really provide any of the greater benefits of the cloud that come from "someone else's server".

    However, to the point of microservices as the article mentions, you probably should look at lambda (or fargate, or a mix) unless you can really saturate the capacity of multiple servers.

    When we swapped to ECS+EC2 running microservices over to lambda our costs dropped sharply. Even serving millions of requests a day we spend a lot of time in between idle, especially spread across the services.

    Additionally, we have 0 outages now from hardware in the last 5 years. As an engineer, this has made my QoL significantly better.

    • > I agree that AWS EC2 is probably too expensive on the whole.

      Probably? It's about 5-10X more expensive than equivalent services from Hetzner.

  • It really depends on what you are doing. But when you factor the network features, the ability to scale the solution, etc you get alot of stuff inside that $200/mo EC2 device. The product is more than the VM.

    That said, with a defined workload without a ton of variation or segmentation needs there are lots of ways to deliver a cheaper solution.

  • I don’t disagree but “cores” is not a good measure of computational power.

    • True, but the cores on a dedicated Hetzner box obliterate the cores on an EC2 machine every time I’ve tested them. So, if anything, it understates the massive performance gap.

      3 replies →

  • > At Hetzner, you can rent a machine with 48 cores and 128GB of RAM for the same money.

    The problem that Hetzner and a lot of hardware providing hosts have, is the lack of affordable flexibility.

    Hetzner their design is based upon a base range of standardized products. This can only be upgraded within a pre-approved range of upgrade options (limited to storage/memory).

    Upgrades are often a mixed bag of carefully designed "upgrade paths". As you can expect, upgrades are not cheap. Doubling the storage on a base server, often increases the price of your server by 50 to 75%. The typical customizing will cost you dearly.

    This is where AWS wins a lot more. Yes, they are expensive as hell, but you often are not stuck to a base config and a limited upgrade path. The ability to scale beyond what Hetzner can offer is there, and your not forced to overbuy from the start. Transferring between servers is a few buttons and done. With Hetzner, if you did not overspec from the start, your going to do those fun server migrations.

    The ironic part is, that buying your own hardware and running it yourself, often ends up paying back within a 8~12 month periode (not counting electricity / internet). And you maintain a lot more flexibility.

    * You want to use bifurcation, go for it.

    * You want to use consumer 4TB nvme's for second layer read storage (what hetzner refuses to offer as they limited those to 2TB and only one a few servers), go for it.

    * You want a 10Gbit interlink between your server, go for it. No need to pay a monthly fee! No need to reserve "future space".

    * O, you want a 25Gbit, go for it (hetzner = not possible).

    * You want 50Gbit ...

    * You want to chuck in a few LLM capable GPUs without breaking the bank...

    Its ironic that we are 2025 and Hetzner is stil limited to 1Gbit connection on its hardware, when just about any consumer level hardware has 2.5Gbit by default for years.

    Your own hardware gives you the flexibility of AWS and the cost saving beyond Hetzner. Maybe its just my environment, but i see more and more smaller to medium companies going back to their own locally run servers. Not even colocation.

    The increase in consumer level fiber, what used to be expensive or not available, has opened the doors for businesses. Most companies do not need insane backbones.

    The fact that you can get business fiber 10Gbit for a 100 Euro price in some EU countries (of course never the north), is insane. I even seen some folks combining fiber with starlink & 5G as backup in case their fiber fails/is out.

    As long as you fit within a specific usage case that is being offered by Hetzner, they are cheap. But its the moment you step outside that comfort zone, ... This is one of Hetzner weaknesses and where AWS or Self hosted comes back.

    • Almost reminds of Rackspace back in...2011

      We had a leased server from them, running VMware, and we had Linux virtual machines for our application.

      We ran out of RAM. We only had 16 or 32GB at the time. Hey, can we double this? Sure, but our payment would nearly double. How does that make any sense?

      If this were a co-located box we owned, I could buy a pair of $125 chips from Crucial (or $250 Dell chips from CDW) and there we go. But we're expected to pay this much more per month?

      Their answer was "you can do more with the server so that's what you're paying for"

      Storage was a similar situation, we were still on RAID with spinning drives and we wanted to go SSD, not even NVME. Wasn't going to happen. And if we went to a new server we'd have to get all new IP's and stuff. Ugh.

      And 10Gb...that was a pipe dream. Costs were insane.

      We ended up having to decide between two things:

      1. Move to a co-lo and buy a couple servers, ala StackExchange. This is what I wanted to do.

      2. Tweak the current application stack, and re-write the next version to run on AWS.

      What did we end up doing? Some half ass solution using the existing server for DB and NGINX proxy, while running the sites on (very slow) Slicehost instances (which Rackspace had recently acquired and roughly integrated into their network). So we still had downtime issues, slow databases, etc.

    • > Doubling the storage on a base server, often increases the price of your server by 50 to 75%

      For storage, Hetzner does offer Volumes, which you can attach to your VM and you can choose exactly how large you want them to be and are charged separately. But your argument about doubling resources and doubling prices still holds for RAM.

      2 replies →

  • On AWS if you want raw computational capacity you use Lambda and not EC2. EC2 is for legacy type workloads and doesn't have nearly the same scaling power and speed that Lambda does.

    I have several workloads that just invoke Lambda in parallel. Now I effectively have a 1000 core machine and can blast through large workloads without even thinking about it. I have no VM to maintain or OS image to consider or worry about.

    Which highlights the other difference that you failed to mention. Hertzner charges a "one time setup" fee to create that VM. That puts a lot of back pressure on infrastructure decisions and removes any scalability you could otherwise enjoy in the cloud.

    If you want to just rent a server then Hertzner is great. If you actually want to run "in the cloud" then Hertzner is a non-starter.

    • Strong disagree here. Lambda is significantly more expensive per vCPU hour and introduces tight restrictions on your workflow and architecture, one of the most significant being maximum runtime duration.

      Lambda is a decent choice when you need fast, spiky scaling for a lot simple self-contained tasks. It is a bad choice for heavy tasks like transcoding long videos, training a model, data analysis, and other compute-heavy tasks.

      5 replies →

    • That's fine, except for all of Lambda's weird limitations: request and response sizes, deployment .zip sizes, max execution time, etc. For anything complicated you'll eventually you run into all this stuff. Plus you'll be locked into AWS.

      2 replies →

    • > [Hetzner] charges a "one time setup" fee to create that VM. That puts a lot of back pressure on infrastructure decisions and removes any scalability you could otherwise enjoy in the cloud.

      Hetzner Cloud, then! In the US, $0.53/hr / $333.59/mo for 48 vCPU/192GB RAM/960GB NVMe. Includes 8 TB/mo traffic, when 8 TB egress would cost $720 on EC2; more traffic is $1.20/TB when the first tier of AWS egress is $90/TB. No setup fee. Not that it's EC2 but there's clearly flexibility there.

      More generally, if you want AWS, you want AWS; if you want servers you have options.

HN uses two—one live and one backup, so we can fail over if there's a hardware issue or we need to upgrade something.

It's a nice pattern. Just don't make them clones of each other, or they might go BLAM at the same time!

https://news.ycombinator.com/item?id=32028511 (<-- this is where it got figured out)

---

Edit: both these points are mentioned in the OP.

  • Any stats on HN downtime over the years? I remember one or two outages in the last decade or so, but I would guess the uptime is about 99.99%.

    • We don't specifically track that, no. The worst one was when we went down for (IIRC) a couple days because of a disk failure, I think in Jan 2014. It was after that that we added a failover box.

      HN goes down when we restart the server process, usually as part of updating the code - but only for a few seconds. The message "Restarting the server. Shouldn't take long." displays when that is happening.

      There are also, to my exasperation, still moments of brownout during certain traffic spikes or moments of obscure resource contention. But these are at least rarer than they used to be.

I’ve found that it’s hard to even hire engineers who aren’t all in on cloud and who even know how to build without it.

Even the ones who do know have been conditioned to tremble with fear at the thought of administrating things like a database or storage. These are people who can code cryptography kernels and network protocols and kernel modules, but the thought of running a K8S cluster or Postgres fills them with terror.

“But what if we have downtime!” That would be a good argument if the cloud didn’t have downtime, but it does. Most of our downtime in previous years has been the cloud, not us.

“What if we have to scale!” If we are big enough to outgrow a 256 core database with terabytes of SSD, we can afford to hire a full time DBA or two and have them babysit a cluster. It’ll still be cheaper.

“What if we lose data?” Ever heard of backups? Streaming backups? Hot spares? Multiple concurrent backup systems? None of this is complex.

“But admin is hard!” So is administrating cloud. I’ve seen the horror of Terraform and Helm and all that shit. Cloud doesn’t make admin easy, just different. It promised simplicity and did not deliver.

… and so on.

So we pay about 1000X what we should pay for hosting.

Every time I look at the numbers I curse myself for letting the camel get its nose under the tent.

If I had it to do over again I’d forbid use of big cloud from day one, no exceptions, no argument, use it and you’re fired. Put it in the articles of incorporation and bylaws.

  • I have also found this happening. It's actually really funny because I think even I'm less inclined to run postgres myself these days, when I used to run literally hundreds of instances with not much more than PG_DUMP, cron and two read only replicas.

    These days probably the best way of getting these 'cloudy' engineers on board is just to tell them its Kubernetes and run all of your servers as K3s.

    • I’m convinced that cloud companies have been intentionally shaping dev culture. Microservices in particular seem like a pattern designed to push managed cloud lock in. It’s not that you have to have cloud to use them, but it creates a lot of opportunities to reach for managed services like event queues to replace what used to be a simple function call or queue.

      Dev culture is totally fad driven and devs are sheep, so this works.

      1 reply →

I helped bootstrap a company that made an enterprise automation engine. The team wanted to make the service available as SaaS for boosting sales.

They could have got the job done by hosting the service in a vps with a multi-tenant database schema. Instead, they went about learning kubernetes and drillingg deep into "cloud-native" stack. Spent a year trying to setup the perfect devops pipeline.

Not surprisingly the company went out of business within the next few years.

  • > Not surprisingly the company went out of business within the next few years.

    But the engineers could find new jobs thanks to their acquired k8s experience.

  • This is my experience too—there’s too much time wasted trying to solve a problem that might exist 5 years down the road. So many projects and early-stage companies would be just fine either with a PaaS or nginx in front of a docker container. You’ll know when you hit your pain point.

  • Yep, this is why I'm a proponent of paas until the bill actually hurts. Just pay the heroku/render/fly tax and focus on product market fit. Or, play with servers and K8s, burning your investors money, then move on to the next gig and repeat...

    • The moment I sign up for a PaaS the bill hurts. I can never get over the fact I can get 1000x more compute for the same price, never mind that I never use it and have to set everything up myself. I’ll just never pay to lock myself in to something so restricted. My dedicated server allows me to do anything I want or need.

      3 replies →

    • > Or, play with servers and K8s, burning your investors money, then move on to the next gig and repeat...

      I mean, of the two, the PaaS route certainly burns more money, the exception being the rare shop that is so incompetent they can't even get their own infrastructure configured correctly, like in GP's situation.

      There are guaranteed more shops that would be better off self-hosting and saving on their current massive cloud bills than the rare one-offs that actually save so much time using cloud services, it takes them from bankruptcy to being functional.

      5 replies →

    • Yeah, same. Vercel + Neon and then if you actually have customers and actually end up paying them enough money that it becomes significant, then you can refactor and move platforms, but until you do, there are bigger fish to fry.

      1 reply →

I've been doing hybrid colo+public cloud for over a decade and it's always been the most cost effective route at a certain scale. That specific break even point is lowering over time with the density and cost effectiveness of hardware.

Sure you need net/infra admins but the software and hardware these days are pretty management friendly and you'll find you still need (often more expensive "cloud") admins so you're not offsetting much management cost there.

Colocation is plentiful and providers often aggregate and resell bandwidth from their preferred carriers.

At one point we were up to 8 dell vrtx clusters and a few SANs, with 500+ VMs from huge msSQL servers to kube clusters the public cloud bill would have been well into the 6 figures even with preferred pricing and reserved instances. Our colocation bill was $2400/mo and that was mostly for power. The one thing that always surprised me was how much faster everything was - every time we had to scale-over into the cloud the public cloud node was noticably slower even for identical CPU generations and vcpu.

You need to be very keen about server deals, updates, support contracts and licenses - but it's really manageable and interconnecting with the cloud is trivial at this point - you can get a "cloud connect" fiber drop to your preferred cloud provider and connect your colo infra to your vpc.

  • Colocation to me means you buy your own hardware and rent only the rack space (and power and connectivity) from the datacenter. Is that really what you're talking about? If so, why do you choose this over renting bare metal servers?

    • Not always - you can lease your servers from the vendor as well, in which case you're renting the rack space, power and cooling from the datacenter and you're renting the servers from the vendor - most of the leases are designed so you can refresh your hardware every 4-5 years and it's usually still cheaper than renting from a dedicated hosting company.

      Once you have an established baseline for your server needs - it's almost always more capital friendly to buy the servers and keep them running for the ~5 reliable years you'll get out of them - usually break even here is 2-3 years vs renting from a provider. If you're running your servers until they fail you'll get 7-10 years out of them, provided the power cost is still worth running them (usually that is also around the 8-10 year mark depending on your power cost).

      So there are many reasons you'd buy vs rent - including capital deductions and access to cheap interest rates. You can also get some pretty crazy deals (like 33% of new price) by buying 2-3 year old equipment, then continue to run them for another 4-5 years, which is the lowest cost scenario if you don't need bleeding edge.

      3 replies →

    • Because it's your hardware in the colo, so if money becomes dire, you can extend the servers lifetime beyond the standard depreciation schedule. Your rented bare metal servers might be slightly cheaper than a respective EC2 instance, but you stop paying that bill, it's gonna go poof, same as the EC2 instance.

    • I went with buying and colocation because I found I sleep better this way than when I used to rent servers in a distant datacenter and have to count on techs I'd never met working on hardware I'd never seen if anything went wrong. In my case, I live near the datacenter, so I can be hands-on fairly quickly if something goes wrong that I can't handle remotely.

      And I can do whatever I want with the hardware. When I bought my servers, they came with disk controllers with non-optional RAID, as almost all of them do. I wanted to run RAIDz2 in FreeBSD/ZFS, so I swapped in non-RAID controllers. They were just a few bucks, but having that ability meant I could choose from a wider range of servers.

A lot of the time businesses just aren't that important. The amount places I've seen that stress over uptime when nothing they run is at all critical. Hell you could drop the production environment in the middle of the day and yes it would suck and you'd get a few phone calls but life would go on.

These companies all ended up massively increasing their budgets switching to cloud workloads when a simple server in the office was easily enough for their 250 users. Cloud is amazing for some uses and pure marketing BS for others but it seems like a lot of engineers aim for a perfect scalable solution instead of one that is good enough.

  • I had a team member who would reiterate that during tough times. They come from much more consequential work, so they would often remark that at least nobody dies when we fuck up.

    • Every corporate meeting should start with reminding ourselves that we're all going to die. And it most likely won't be from anything happening at the office.

A thoroughly good article. It's probably worth also considering adding a CDN if you take this approach at scale. You get to use their WAF and DNS failover.

A big pain point that I personally don't love is that this non-cloud approach normally means running my own database. It's worth considering a provider who also provides cloud databases.

If you go for an 'active/passive' setup, consider saving even more money by using a cloud VM with auto scaling for the 'passive' part.

In terms of pricing the deals available these days on servers are amazing you can get 4GB RAM VPSs with decent CPU and bandwidth for ~$6 or bare metal for ~$90 for 32GB RAM quad core worth using sites like serversearcher.com to compare.

  • What’s the issue with running Postgres inside a docker container + regular backups? Never had problem and relatively easy to manage.

    • no PITB, but mostly just 'it's hassle' for the application server I literally don't need backups, just automated provisioning/docker container etc. Adding postgres then means I need full backups including PITB because I don't even want to lose an hours data.

      6 replies →

  • If you're running on a single machine then you'll get way more performance with something like sqlite (instead of postgres/MySQL) which also makes managing the database quite trivial.

    • SQLite has serious concurrency concerns which have to be evaluated. You should consider running postgres or mysql/mariadb even if it's on the same server.

      SQLite uses one reader/writer lock over the whole database. When any thread is writing the database, no other thread is reading it. If one thread is waiting to write, new reads can't begin. Additionally, every read transaction starts by checking if the database has changed since last time, and then re-loading a bunch of caches.

      This is suitable for SQLite's intended use case. It's most likely not suitable for a server with 256 hardware threads and a 50Gbps network card. You need proper transaction and concurrency control for heavy workloads.

      Additionally, SQLite lacks a bunch of integrity checks, like data types and various kinds of constraints. And things like materialised views, etc.

      SQLite is lite. Use it for lite things, not hevy things.

      7 replies →

Just today I wasted some time due to an unexpected Tailscale key expiry and some other issues related to running a container cluster: https://blog.kronis.dev/blog/the-great-container-crashout

Right now, my plan is to move from a bunch of separate VPSes, to one dedicated server from Hetzner and run a few VMs inside of it with separate public IPs assigned to them alongside some resource limits. You can get them for pretty affordable prices, if you don't need the latest hardware: https://www.hetzner.com/sb/

That way I can limit the blast range if I mess things up inside of a VM, but at the same time benefit from an otherwise pretty simple setup for hosting personal stuff, a CPU with 8 threads and 64 GB of RAM ought to be enough for most stuff I might want to do.

  • That's the worst part of stringing a bunch of cloud together. Auth, keys, config, credentials expiring, logging back into everything all day. It smooths out the brain.

    Give me a box, trust me with ssh keys and things are so much easier. Simple is good for the soul and the wallet.

Regardless of the cost and capacity analysis, it's just hard to fight the industry trends. The benefits of "just don't think about hardware" are real. I think there is a school of thought that capex should be avoided at all costs (and server hardware is expensive up front). And above all, if an AWS region goes down, it doesn't seem like your org's fault, but if your bespoke private hosting arrangement goes down, then that kinda does seem like your org's fault.

  • > and server hardware is expensive up front

    You don't need to buy server hardware(!), the article specifically mentions renting from eg Hetzner.

    > The benefits of "just don't think about hardware" are real

    Can you explain on this claim, beyond what the article mentioned?

    • > Can you explain on this claim, beyond what the article mentioned?

      I run a lambda behind a load balancer, hardware dies, its redundant, it gets replaced. I have a database server fail, while it re provisions it doesn't saturate read IO on the SAN causing noisy neighbor issues.

      I don't deal with any of it, I don't deal with depreciation, I don't deal with data center maintenance.

      5 replies →

  • > I think there is a school of thought that capex should be avoided at all costs

    Yep, and it's mostly caused by the VC funding model - if your investors are demanding hockey-stick growth, there is no way in hell a startup can justify (or pay for) the resulting Capex.

    Whereas a nice, stable business with near-linear growth can afford to price in regular small Capex investments.

  • > I think there is a school of thought that capex should be avoided at all costs (and server hardware is expensive up front).

    Yes, there is.

    Honestly, it looks to me that this school of thought is mostly adopted by people that can't do arithmetic or use a calculator. But it does absolutely exist.

    That said, no, servers are not nearly expensive enough to move the needle on a company nowadays. The room that often goes around them is, and that's why way more people rent the room than the servers in it.

    • Connectivity is a problem, not the room.

      I ran the IT side of a media company once, and it all worked on a half-empty rack of hardware in a small closet... except for the servers that needed bandwidth. These were colocated. Until we realized that the hoster did not have enough bandwidth, at which point we migrated to two bare metal servers at Hetzner.

      2 replies →

  • If you rent dedicated servers, then you're not worrying about any of the capex or maintenance stuff.

  • the benefits of don't write a distributed system unless you really have to are also very real

    • Exactly, same for microservices I feel. Why have enterprise org problems if you don't have an enterprise org.

  • I think you hit the nail on the head. What enterprise are paying for is abstraction of responsibility. Suits would never criticise going with Microsoft or Amazon.

  • > if an AWS region goes down, it doesn't seem like your org's fault, but if your bespoke private hosting arrangement goes down, then that kinda does seem like your org's fault.

    Never underestimate the price people are willing to pay to evade responsibility. I estimate this is a multi-billion dollar market.

  • For anything up to about 128GB RAM you can still easily avoid capex by just renting servers. Above that it gets a bit trickier

    • It's not like it's a huge capex for that level of server anyway. Probably less than the cost of one employee's laptop.

    • Renting (hosted) servers above 128GB RAM is still pretty easy, but I agree pricing levels out. 128GB RAM server ~$200/Month, 384 GB ~$580, 1024 GB ~$940/Month

  • To be clear - this isn't an endorsement on my part, just observations of why cloud-only deployment seems common. I guess we shouldn't neglect the pressure towards resume-oriented development either, as it undoubtedly plays a part in infra folks' careers. It probably makes you sound obsolete to be someone who works in a physical data center.

    I for one really miss being able to go see the servers that my code runs on. I thought data centers were really interesting places. But I don't see a lot of effort to decide things based on pure dollar cost analysis at this point. There's a lot of other industry forces besides the microeconomics that predetermine people's hosting choices.

This isn't even the end game for "one big server". AMD will give the most bang per rack, but there are other factors.

An IBM z17 is effectively one big server too, but provides levels of reliability that are simply not available in most IT environments. It won't outperform the AMD rack, but it will definitely keep up for most practical workloads.

If you sit down and really think honestly about the cost of engineering your systems to an equivalent level of reliability, you may find the cost of the IBM stack to be competitive in a surprising number of cases.

  • At what cost politically? I would expect political battles to be far more intense than any of the technical ones.

    • That’s because 75% (citation: wild-ass estimate) of tech workers are incapable of critical thinking, and blindly parrot whatever they’ve heard / read. The number of times I’ve seen something on HN, thought “that doesn’t sound right,” and then spent a day disproving it locally is too damn high. Of course, by then no one gives a shit, and they’ve all moved on patting each other on the back about how New Shiny is better.

      2 replies →

  • no. In the short time I work at a z/OS-shop, they had to IPL twice. And the IPL takes ages...

    Now, if you can live with the weird environment and your people know how to programm what is essentially a distributed system described in terms noone else uses: I guess it's still ok, given the competition is all executing IBMs playbook too.

    • Entire mainframe IPL, or just LPAR?

      My understanding is that usually you subdivide into few LPARs and then reboot the production ones on schedule to prevent drift and ensure that yes, unplanned IPLs will work

The complexity you introduce trying to achieve 100% uptime will often undermine that goal. Most businesses can tolerate an hour or two of downtime or data loss occasionally. If you set this expectation early on, you can engineer a much simpler system. Simpler systems are more reliable.

  • We had single-datacenter resiliency (meaning n+1 on power, cooling, network + isp, servers) and it was fine. You still need offsite DRS strategy here - this is one of the things having that hybrid cloud is great for: you can replicate your critical workloads like databases and services to the cloud in no-load standby, or delta-copy your backups to a cheap cloud provider for simplified recovery in a disaster scenario (ie: entire datacenter gets taken out). The cost of this is relatively low since data into the cloud is free and you're only really incurring costs in a disaster recovery scenario. Most virtualized platforms (veeam etc) support offsite secondary incremental backups with relative ease, recovery is also pretty straightforward.

    That being said I've lost a lot of VMs on ec2 and had entire regions go down in gcp and aws in the last 3 years alone, so going to the public cloud isn't a solves it all solution - knock on wood the colo we've been using hasn't been down once in 12+ years.

  • Much less expensive too.

    I think in general that expectation is NOT acceptable though especially around data loss. Because the non engineering stakeholders don't believe it is.

    Engineers don't make decisions in a vacuum, if you can manage the expectations, good for you. But in most cases that's very much an uphill battle which might make you look incompetent because you cannot guarantee no data loss.

I run on VPSs as well. I ditched cloud a long time ago. Once my project starts making money, I will definitely buy my own hardware and collocate. Cloud is like dating apps. We had fun for a decade but it's time to get serious and get some things actually done and be productive again.

  • > I will definitely buy my own hardware and collocate.

    Even colocation is often fraud with issues. I shall not mentioned the plectra of dead hardware from datacenter electricity failures. Ironically, my home has more stable electricity then some datacenters lol.

    Unless you running a business where a few minutes downtime will cost you millions, most companies can literally run their own servers from their basements. I often see how much people overestimate their need for 99.999% uptime, or bandwidth requirements.

    Its not like colocation is that much cheaper. The electricity prices your paying are often more expensive then even business/home electricity. That leave only internet/fiber, and the pletra of commercial fiber these days.

    Used to get minimum quoted price of 2k, for a 1Gbit business fiber years ago (not inc install costs). Now you get in some countries, 5 or 10Gbit for 100 Euro business fiber.

    • Sometimes I wonder why I'm not running my servers from my home, considering my 1Gb fiber has 3ms latency, and a good UPS would get me through all but a couple of the longest power outages I've had in the last 15 years. As long as I'm hosting small business web sites or something like that, and not critical banking or hospital systems, there's no reason it wouldn't be fine.

      When my colo contract runs out in a couple years, I may seriously consider it, especially since they're already talking about offering bigger bandwidth packages.

      1 reply →

Bare-metal servers sound super cheap when you look at the price tag, and yeah, you get a lot of raw power for the money. But once you’re in an enterprise setup, the real cost isn’t the hardware at all, it’s the people needed to keep everything running.

If you go this route, you’ve got to build out your own stack for security, global delivery, databases, storage, orchestration, networking ... the whole deal. That means juggling a bunch of different tools, patching stuff, fixing breakage at 3 a.m., and scaling it all when things grow. Pretty soon you need way more engineers, and the “cheap” servers don’t feel so cheap anymore.

  • A single, powerful box (or a couple, for redundancy) may still be the right choice, depending on your product / service. Renting is arguably the most approachable option: you're outsourcing the most tedious parts + you can upgrade to a newer generation whenever it becomes operationally viable. You can add bucket storage or CDN without dramatically altering your architecture.

    Early Google rejected big iron and built fault tolerance on top of commodity hardware. WhatsApp used to run their global operation employing only 50 engineering staff. Facebook ran on Apache+PHP (they even served index.php as plain text on one occasion). You can build enormous value through simple means.

  • If you use a cloud, you need a solution for security (ever heard of “shared responsibility”?), global delivery (a big cloud will host you all over, and this requires extra effort on your part, kind of like how having multiple rented or owned servers requires extra effort), storage (okay, I admit that S3 et al are nice and that non-big-cloud solutions are a bit lacking in this department), orchestration (the cloud handles only the lowest level — you still need to orchestrate your stuff on top of it), fixing breakage at 3 a.m. (the cloud can swap you onto a new server, subject to availability; so can a provider like Hetzner. You still need to fail over to that server successfully), patching stuff (other than firmware, the cloud does not help you here).

  • I used to say "oh yeah just run qemu-kvm" until my girlfriend moved in with me and I realized you do legitimately need some kind of infrastructure for managing your "internal cloud" if anyone involved isn't 100% on the same page and then that starts to be its own thing you really do have to manage.

    Suddenly I learned why my employer was willing to spend so much on OpenStack and Active directory.

    • > until my girlfriend moved in with me

      lol, why was this the defining moment? She wasn't too keen on hearing the high pitch wwwwhhhhuuuuurrrrrrr of the server fans?

      1 reply →

Microservices vs not is (almost) orthogonal to N servers vs one. You can make 10 microservices and rent a huge server and run all 10 services. It's more an organizational thing than a deployment thing. You can't do the opposite though, make a monolith and spread it out on 10 servers.

  • > You can't do the opposite though, make a monolith and spread it out on 10 servers.

    You absolutely can, and it has been the most common practice for scaling them for decades.

    • That’s just _duplicating_ the nodes horizontally which wasnt what I meant.

      That’s obviously possible snd common.

      What I meant was actually butchering the monolith into separate pieces and deploying it, which is - by the definition of monolith - impossible.

      3 replies →

  • > You can't do the opposite though, make a monolith and spread it out on 10 servers.

    Yes you can. Its called having multiple applications servers. They all run the same application, just more of them. Maybe they connect to the same DB, maybe not, maybe you shard the DB.

I often wonder if my home NAS/Server would be better off put onto a rented box or a cloud server somewhere, especially since I now have 1gbit/s internet. Even now the 20TB of drive space and 6 Cores with 32GB on Hetzner with a dedicated is about twice the price of buying the hardware over a 5 year period. I suspect the hardware will actually last longer than that and its the same level of redundancy (RAID) on a rented dedicated so the backup is the same cost between the two.

Using cloud and box storage on Hetzner is more expensive than the dedicated server, 4x owning the hardware and paying the power bill. AWS and Azure are just nuts, >100x the price because they charge so much for storage even with hard drives. Contabo nor Netcup can do this, its too much storage for them.

Every time I look at this I come to the same basic conclusion, the overhead of renting someone else’s machine is quite high compared to the hardware and power cost and it would be a worse solution than having that performance on the local network for bandwidth and latency. The problem isn't so much the compute performance, that is relatively fairly priced, its the storage costs and data transfer that bites.

Not really what the article was necessarily about but cloud is sort of meant to be good for low end hardware but its actually kind of not, the storage costs are just too high even a Hetzner Storage box.

  • It really depends on your power costs. In certain parts of Europe, power is so expensive that Hetzner actually works out cheaper (despite them providing you the entire machine and datacenter-grade internet connection).

    • Trust me, even with 35 cent/kwh (Germany), its easy to make it work. Just do not buy enterprise hardware. People are obsessed with running racks full of often obsolete hardware, that is not designed around energy efficiency.

      Here is a fun one ...

      https://www.reddit.com/r/selfhosted/comments/1dqq3h8/my_12x_...

      Dude is running 12x AMD 6600HS with a power draw between 300 a 400W. The compute alone is easily 3x of a equivalent Hetzner 48c server. We shall not mention the that inc 768GB of memory (people underestimate how much high capacity rdimms draw in power).

      The main issue with Hetzner is, as long as you only use their base configuration servers, they are very competitive. The issue is, if you start to step a little bit out of line, the prices simply skyrocket. Try adding more storage to some servers, memory, or you need a higher interconnect between your servers (limited to 10Gbit).

      Even basic consumer hardware comes with 2.5Gbit, yet, Hetzner is in the stone ages with 1Gbit. I remember the time when Hetzner introduced 1Gbit. Hetzner was innovation, and progression. But that has been slowly vanishing. Hetzner has been getting more and more lazy. You see the issue also with their cloud offerings storage. Look at Netcup, even Strato etc... They barely introduce anything new anymore, and when something comes its often less competitive or broken. The whole S3 costing Backblaze price levels and non-stop issues.

      You can tell they are the only company that every pushed for consumer hardware hosting on mass scale, what made them a small monopoly in the market. And it shows if your a old customer, and know their history. Hey, do people remember the price increases for the auction hardware because of the Ukraine invasion. Do not worry folks, when the electricity prices go down, we will adjust them down. O, we are pre-war prices for almost 2 years. Where is that promised price drops? Hehehe ...

      2 replies →

  • I think I’ve settled on both being the answer - Hetzner is affordable enough that I can have a full backup of my NAS (using ZFS snapshots and incremental backups), and as a bonus can host some services there instead of at home. My home network still has much lower latency and so is preferable for ie. my Lightroom library.

These days we have more meta-software than software. Instead of Apache with virtualhosts, we have a VM running Docker instances, each with an nginx of its own, all connected by a separate Docker of nginx acting as a proxy.

How much waste is there from all this meta-software?

In reality, I host more on Raspberry Pis with USB SSDs than some people host on hundred-plus watt Dells.

At the same time, people constantly compare colo and hardware costs with the cost per month of cloud and say cloud is "cheaper". I don't even bother to point out the broken thinking that leads to that. In reality, we can ignore gatekeepers and run things out of our homes, using VPSes for public IPs when our home ISPs won't allow certain services, and we can still have excellent uptimes, often better than cloud uptimes.

Yes, we can consolidate many, many services in to one machine because most services aren't resource heavy constantly.

Two machines on two different home ISP networks backing each other up can offer greater aggregate uptime than a single "enterprise" (a misnomer, if you ask me, if you're talking about most x86 vendors) server in colo. A single five minute reboot of a Dell a year drops uptime from 100% to 99.999%.

Cloud is such bullshit that it's exhausting even just engaging with people who "but what if" everything, showing they've never even thought about it for more than a minute themselves.

I did this (well, a large-r VPS for $120/month) for my Rails-based sports streaming website. I had a significant amount of throughput too, especially at peak (6-10pm ET).

My biggest takeaway was to have my core database tables (user, subscription, etc) backed up every 10 minutes, and the rest every hour, and test their restoration. (When I shut down the site it was 1.2TB.) Having a script to quickly provision a new node—in case I ever needed it—would have something up within 8 minutes from hitting enter.

When I compare this to the startups I’ve consulted for, who choose k8s because it’s what Google uses yet they only push out 1000s of database queries per day with a handful of background jobs and still try to optimize burn, I shake my head.

I’d do it again. Like many of us I don’t have the need for higher-complexity setups. When I did need to scale, I just added more vCPUs and RAM.

  • Is there somewhere I can read more about your setup/experience with your streaming site? I currently run a (legal :) streaming site but have it hosted on AWS and have been exploring moving everything over to a big server. At this point it just seems like more work to move it than to just pay the cloud tax.

    • Do a search for HeheStreams on your favorite search engine.

      The technical bits aren’t all there, though, and there’s a plethora of noise and misinformation. Happy to talk via email though.

      1 reply →

I’ve been having those discussions with friends for the last 3 or 4 years. The downside of having local infra is pretty much having someone that has the experience to do it right. While this article covered the higher end, the math on the lower end tends to work out at 1 year of ownership depending on what you already have because you will probably already have a small rack and some networking gear.

My main concern is that at the current cloud premiums rates, I will be better off even if I need to hire someone specifically for managing the local infra.

This was written in 2022, but looks like it's most still relevant today. Would be interesting to see updated numbers on the expected costs of various hosting providers.

I work for a cloud provider and I'll tell you, one of the reasons for the cloud premium is that it is a total pain in the ass to run hardware. Last week I installed two servers and between them had four mysterious problems that had to be solved by reseating cards, messing with BIOS settings, etc. Last year we had to deal with a 7 site, 5 country RMA for 150 100gb copper cables with incorrect coding in their EEPROMs.

I tell my colleagues: it's a good thing that hardware sucks: the harder it is to run bare metal, the happier our customers are that they choose the cloud. :)

(But also: this is an excellent article, full of excellent facts. Luckily, my customers choose differently.)

  • Fortunately, companies like Hetzner/OVH/etc will handle all this bullshit for you for a flat monthly fee.

I used a colo once a few years ago at a small datacenter in the midwest, I was shocked at how unprofessional everything was, machines laying in the hallway, a guy was sleeping in one of the offices. They let me setup my server and was left unattended several times, I could have just poked the power button on a nearby server or moved a cable or whatever. It was a 1.5 hour drive away, and I wasn't running anything serious so I just went with it but pulled my stuff out after my 1 year subscription was up.

The problem is sizing and consistency. When you're small, it's not cost effective to overprovision 2-3 big servers (for HA).

And when you need to move fast (or things break), you can't wait a day for a dedicated server to come up, or worse, have your provider run out of capacity (or have to pick a different specced server)

IME, having to go multi cloud/provider is a way worse problem to have.

  • Most industries are not bursty. Overprovision in not expensive for most businesses. You can handle 30000+ updates a second on a 15$ VPS.

    A multi node system tends to be less reliable and more failure points than a single box system. Failures rarely happen in isolation.

    You can do zero downtime deployment with a single machine if you need to.

    • > A multi node system tends to be less reliable and more failure points than a single box system. Failures rarely happen in isolation.

      Just like a lot of problems exists between keyboard and chair, a lot of problems exist between service A and service B.

      The zero downtime deployment for my PHP site consisted of symlinking from one directory to another.

      2 replies →

  • There are a number of providers who provision dedicated servers via API in minutes these days. Given a dedicated server starts at around $90/Month it probably does make sense for alot of people.

    • A $20 dedicated server from OVH can outperform $144 VPSs from Linode in my testing, on passmark.

Don't forget the cost of managing your one big server and the risk of having such single point of failure.

  • My experience after 20 years in the hosting industry is that customers in general have more downtime due to self-inflicted over-engineered replication, or split brain errors than actual hardware failures. One server is the simplest and most reliable setup, and if you have backup and automated provisioning you can just re-deploy your entire environment in less than the time it takes to debug a complex multi-server setup.

    I'm not saying everybody should do this. There are of-course a lot of services that can't afford even a minute of downtime. But there is also a lot of companies that would benefit from a simpler setup.

    • Yep. I know people will say, “it’s just a homelab,” but hear me out: I’ve ran positively ancient Dell R620s in a Proxmox cluster for years. At least five. Other than moving them from TX to NC, the cluster has had 100% uptime. When I’ve needed to do maintenance, I drop one at a time, and it maintains quorum, as expected. I’ll reiterate that this is on circa-2012 hardware.

      In all those years, I’ve had precisely one actual hardware failure: a PSU went out. They’re redundant, so nothing happened, and I replaced it.

      Servers are remarkably resilient.

      EDIT: 100% uptime modulo power failure. I have a rack UPS, and a generator, but once I discovered the hard way that the UPS batteries couldn’t hold a charge long enough to keep the rack up while I brought the generator online.

      2 replies →

    • My single on-premise Exchange server is drastically more reliable than Microsoft's massive globally resilient whatever Exchange Online, and it costs me a couple hours of work on occasion. I probably have half their downtime, and most of mine is scheduled when nobody needs the server anyhow.

      I'm not a better engineer, I just have drastically fewer failure modes.

      2 replies →

    • A lot of this attitude comes from the bad old days of 90s and early 2000s spinning disk. Those things failed a lot. It made everyone think you are going to have constant outages if you don’t cluster everything.

      Today’s systems don’t fail nearly as often if you use high quality stuff and don’t beat the absolute hell out of SSD. Another trick is to overprovision SSD to allow wear leveling to work better and reduce overall write load.

      Do that and a typical box will run years and years with no issues.

    • > My experience after 20 years in the hosting industry is that customers in general have more downtime due to self-inflicted over-engineered replication, or split brain errors than actual hardware failures.

      I think you misread OP. "Single point of failure" doesn't mean the only failure modes are hardware failures. It means that if something happens to your nodes whether it's hardware failure or power outage or someone stumbling on your power/network cable, or even having a single service crashing, this means you have a major outage on your hands.

      These types of outages are trivially avoided with a basic understanding of well-architected frameworks, which explicitly address the risk represented by single points of failure.

      12 replies →

    • In my experience, my personal services have gone down exactly zero times. Actually not entirely true, but every time they stopped working the servers had simply run out of disk space.

      The number of production incidents on our corporate mishmash of lambda, ecs, rds, fargate, ec2, eks etc? It’s a good week when something doesn’t go wrong. Somehow the logging setup is better on the personal stuff too.

    • I also have seem the opposite somewhat frenquently: some team screws up the server and unrelated stable services that are running since forever (on the same server) are now affected due messing up the environment.

  • The last 4-5 years taught me that my most often single point of failure where I can't do a thing is Cloudflare not my on premise servers

  • > Don't forget the cost of managing your one big server

    Is that more, less than or about the same as having an AWS/Azure/GCP consultant?

    What's the difference in labour per hour?

    > the risk of having such single point of failure.

    At the prices they charge I can have two hot failovers in two other datacenter and still come out ahead.

  • The predictable cost, you mean, making business planning way easier? And you usually have two, because sometimes kernels do panic or whatever.

  • AWS has also been a single point of failure multiple times in history, and there's no reason to believe this will never happen again.

Being a big server proponent myself. Usually for one reason or the other there is need to introduce some socket style communication to the frontend and that becomes impossible in a single machine after a certain threshold.

Is there something obvious that I'm missing?

  • I've had 100k+ users connected to mid range linode boxes. Do you have that many?

    Even still at that point you just round robin to a set of big machines. Easy

I'm in the process of breaking up a legacy deployment on "one big server" into something cloud native like Kubernetes.

The problem with one big server is that few customers have ONE (1) app that needs that much capacity. They have many small apps that add up to that much capacity, but that's a very different scenario with different problems and solutions.

For example, one of the big servers I'm in the process of teasing apart has about 100 distinct code bases deployed to it, written by dozens of developers over decades.

If any one of those apps gets hacked and this is escalated to a server takeover, the other 99 apps get hacked too. Some of those apps deal with PII or transfer money!

Because a single big server uses a single shared IP address for outbound comms[1] this means that the firewall rules for 100 apps end up looking like "ALLOW: ANY -> ANY" for two dozen protocols.

Because upgrading anything system-wide on the One Big Server is a massive Big Bang Change, nobody has had the bravery to put their hand up and volunteer for this task. Hence it has been kept alive running 13 year old platform components because 2 or 3 of the 100 apps might need some of those components... but nobody knows which two or three apps those are, because testing this is also big-bang and would need all 100 apps tested all at once.

It actually turned out that even Two Big (old) Servers in a HA pair aren't quite enough to run all of the apps so they're being migrated to newer and better Azure VMs.

During the interim migration phase instead of Two Big Server s there are Four Big Servers... in PRD. And then four more in TST, etc... Each time a SysOps person deploys a new server somewhere, they have to go tell each of the dozens of developers where they need to deploy their apps today.

Don't think DevOps automation will rescue you from this problem! For example in Azure DevOps those 100 apps have 100 projects. Each project has 3 environments (=300 total) and each of those would need a DevOps Agent VM link to the 2x VMs = 600 VM registrations to keep up to date. These also expire every 6 months!

Kubernetes, Azure App Service, AWS App Runner, and GCP App Engine serve a purpose: They solve these problems.

They provide developers with a single stable "place" to dump their code even if the underlying compute is scaled, rebuilt, or upgraded.

They isolate tiny little apps but also allow the compute to be shared for efficient hosting.

They provide per-app networking and firewall rules.

Etc...

[1] It's easy to bind distinct ingress IP addresses on even a single NIC (or multipe), but it's weirdly difficult to split the outbound path. Maybe this is easier on Linux, but on Windows and IIS it is essentially impossible.

  • Finally, someone said it.

    > 100 distinct code bases deployed to it

    I've worked in a company, where the owner would spend money on anything except hosting. Admin guy would end up deploying a new app on whatever VPS that had the most RAM free at that time.

    Ironically, consolidating this mess to "one big server", which was my ungrateful job for many months, fixed many issues. Though, it was done by slicing the host into tiny KVM virtual machines.

    • > slicing the host into tiny KVM virtual machines.

      That's my other option: a bunch of Azure VM Scale Sets using the tiniest size that will run Windows Server, such as B2as_v2. A handful of closely related apps on each so that firewall rules can be restricted to something sane. Shared Azure Files for the actual app deployments so that devs never need to know the VM names. However, this feels an awful lot like reinventing Kubernetes... but worse.

      My job would be sooo much simpler if Microsoft just got off their high horse and supported their own Active Directory in App Service instead of pretending it no longer exists.

A lot of these articles look at on-demand pricing for AWS. But you're rarely paying on-demand prices 24/7. If you have a stable workload, you probably buy reserved instances or a compute savings plan. At larger scales, you use third party services to get better deals with more flexibility.

A while back I looked into renting hardware, and found that we would save about 20% compared to what we actually paid AWS – in partially because location and RAM requirements made the rental more expensive than anticipated, and partially because we were paying a lot less than on-demand price for AWS.

20% is still significant, but it's a lot less than the ~80% that this and other articles suggest.

  • This is usually only true of you lift and shift your AWS setup exactly as-is, instead of looking at what hardware will run your setup most efficiently.

    The biggest cost with AWS also isn't compute, but egress - for bandwidth heavy setups you can sometimes finance the entirety of the servers from a fraction of the savings in egress.

    I cost optimize setups with guaranteed caps at a proportion of savings a lot of the time, and I've yet to see a setup where we couldn't cut the cost far more than that.

    • I'd definitely be curious to hear how you'd approach our overall situation. We don't have significant egress costs, nor has any place I've worked with before. Our AWS costs are about 80% EC2 and Fargate, with the rest scattered over various services. Roughly half our spend is on 24/7 reserved instances, while the other half is in bursty analytics workloads.

      Our workloads are primarily memory-bound, and AWS offers pretty good options there, e.g. x2gd instances have 16gb RAM/cpu, while most rental options we found were much more CPU focused (and charged for it.)

      2 replies →

>Unfortunately, since all of your services run on servers (whether you like it or not), someone in that supply chain is charging you based on their peak load.

This seems fundamentally incorrect to me? If I need 100 units of peak compute during 8 hours of work hours, I get that from Big Cloud, and they have two other clients needing same in offset timezones then in theory the aggregate cost of that is 1/3rd of everyone buying their own peak needs.

Whether big cloud passes on that saving is another matter, but it's there.

i.e. big cloud throws enough small customers together so that they don't have "peak" per se just a pretty noisy average load that is in aggregate mostly stable

  • But they generally don't. Most people don't have large enough daily fluctuations for these demand curves to flatten out enough. And the providers also need enough capacity to handle unforeseen spikes. Which is also why none of them will let you scale however far you want - they still impose limits so they can plan the excess they need.

    • > And the providers also need enough capacity to handle unforeseen spikes.

      Indeed but the headroom the cloud needs overall is less than every customers individual worst case scenarios added up. They’d take a percentage of that total because statistically a situation where 100% of customers are at 100% of their peak at 100% same point in time is improbable

      Must admit little surprised this logic isn’t self evident

      1 reply →

  • In which cloud can I book a machine with a guaranteed (up to general uptime SLA) end/termination time that's fixed for both?

and now consider 6th Gen EPYC will have 256 cores also you can have 32 hot-swap SSDs with like 10mil plus of random write IOPS and 60mil plus random read IOPS in a single 2U box

The one big box assumes that you know how to configure everything for high performance. I suspect that skill has been lost, for the most part.

You really need to tweak the TCP/IP stack, buffer sizes, and various other things to get everything to work really well under heavy load. I'm not sure if the various sites that used to talk about this have been updated in the last decade or so, because I don't do that anymore.

I mean, you'll run out of file descriptors pretty quickly if you try to handle a few hundred simultaneous connections. Doesn't matter how big your box is at that point.

Ah, the folksy wisdom of the armchair. Sounds convincing, doesn't it? I mean, it includes math! And prices! The quoted prices are more expensive for the cloud. And he makes folksy claims that make sense, like the "fragile complexity" of having "more than one computer". It makes sense! Right??

But is he right? How do we know? Well for starters, look at his CV. He has never managed servers for a living. The closest he's come is working on FPGAs. So what's he basing all these opinions on? Musings? Thoughts? Feelings? Hope?

He makes a couple claims which it isn't obvious are bunk, so I'll address them here, in reverse order.

"microservice architectures in general add a lot of overhead to a system for dubious gain when you are running on one big server" - Microservices architectures are not about overhead or efficiency. They are an attempt to use good software design principles to address Conway's Law. If you design the microservice correctly, you can enable many different groups in an organization to develop software independently, and come up with a highly effective and flexible organization and stable products. Proof? Amazon. But the caveat is, you have to design them correctly. Almost everyone fails at this.

"It's impossible to get the benefits of a CDN, both in latency improvements and bandwidth savings, with one big server" - This is so dumb I'm not sure I have to refute it? But, uh, no, CDNs absolutely give a heap of benefits whether you have 1 server or 1,000. And CloudFlare Free Plan is Free.

"My Workload is Really Bursty - Cloud away." - Unless your workload involves massive amounts of storage or ingress/egress and your profit margin tiny, in which case you may save more by building out a small fleet of unreliable poorly-maintained colocated servers (emphasis on may).

"The "high availability" architectures you get from using cloudy constructs and microservices just about make up for the fragility they add due to complexity. ... Remember that we are trying to prevent correlated failures. Cloud datacenters have a lot of parts that can fail in correlated ways. Hosting providers have many fewer of these parts. Similarly, complex cloud services, like managed databases, have more failure modes than simple ones (VMs)." - Argument from laziness, or ignorance? He's trying to say that because something is complex it's also less reliable. Which completely ignores the reliability engineering aspect of that complexity. You mitigate higher numbers of failure modes by designing the system to fail over reliably. And you also have warm bodies running around replacing the failing parts, which fights entropy. You don't get that in a single server; once your power supply, disk, motherboard, network interface, RAM, etc fails, and assuming your server has a redundant pair, you have a ticking clock to repair it until the redundant pair fails. How lucky do you feel? (oh, and you'll need downtime to repair it.)

As usual, the cloud costs quoted is MSRP, and if you're paying retail, you're a fool. Almost all cloud costs can be brought down from 25%-75%, spot instances are a fraction of the on-demand server cost, and efficient use of cheaper cloud services reduces your need to buy compute at all.

"The big drawback of using a single big server is availability. Your server is going to need downtime, and it is going to break. Running a primary and a backup server is usually enough, keeping them in different datacenters. A 2x2 configuration should appease the truly paranoid: two servers in a primary datacenter (or cloud provider) and two servers in a backup datacenter will give you a lot of redundancy. If you want a third backup deployment, you can often make that smaller than your primary and secondary." - Wait... so One Big Server isn't enough? Huh. So this was a clickbait article? I'm shocked!

"One Server (Plus a Backup) is Usually Plenty" - Plenty for what? I mean we haven't even talked system architecture or application design. But let's assume it's a single microservice that gets 1RPS. Is your backup server a hot spare, cold spare, or live mirror? If it's live, it's experiencing the same wear, meaning it will fail at about the same time. If it's hot, there's less wear, but it's still experiencing some. If it's cold, you get less wear, but you're less sure it'll boot up again. And then there's system configuration. The author mentions the "complexity" of managing a cluster, but actually it's less complex than managing just two servers. With a fleet of servers, you know you have to use automation, so you spend the time to automate their setup and run updates frequently. With a backup, you probably won't do any maintenance on the backup, and you definitely won't perform the same operations on the backup as the server. So the system state will drift wildly, and the backup's software will be useless. It would be better to just have it as spare part.

The author never talks about the true failure modes of "one big server". When parts start to need replacing, it's never cheap. Smart hands cost, cost of the parts+shipping, cost of the downtime. And often you'll find there are delays - delays in getting smart hands to actually repair it correctly, delays in shipping, delays in part ordering/availability. Running out of power, running out of space, temperatures too high, "flaky" parts you can't diagnose, backups and restores, datacenter issues, routing issues, backbone issues. You'll tell yourself these are "probably rare" - but these are all failure modes, and as the author tells us, you should be wary of lots of failure modes. And anecdotes will tell you somebody has run a server for 10 years with no issue, while another person had a server with 3 faults in a month. To say nothing of the need to run "burn-in" on a new server to discover faults once it's racked.

Go ahead and do whatever you want. Cloud, colo, one server, multiple. There will be failures and complexity no matter what. You want to tell yourself a comforting story that there is "one piece of advice" to follow, some black and white world where only one piece of folksy wisdom applies. But here's my folksy wisdom: design your application, design your system to fit it, try not to pinch every penny, build something, and become educated enough to know what problems to expect and how to deal with them. Or if not, pay someone who can, and listen to them.

And then boom, all your services are gone due to a pesky capacitor on the motherboard. Also good luck trying to change even one software component of that monolith without disrupting and jeopardizing the whole operation.

While it is a useful advice to some people in certain conditions, it should be taken with a grain of salt.

> Part of the "cloud premium" for load balancers, serverless computing, and small VMs is based on how much extra capacity your cloud provider needs to build in order to handle their peak load. You're paying for someone's peak load anyway!

Eh, sort of. The difference is that the cloud can go find other workloads to fill the trough from off peak load. They won’t pay as much as peak load does, but it helps offset the cost of maintaining peak capacity. Your personal big server likely can’t find paying workloads for your troughs.

I also have recently come to the opposite conclusion for my personal home setup. I run a number of services on my home network (media streaming, email, a few personal websites and games I have written, my frigate NVR, etc). I had been thinking about building out a big server for expansion, but after looking into the costs I bought 3 mini pcs instead. They are remarkably powerful for their cost and size, and I am able to spread them around my house to minimize footprint and heat. I just added them all to my home Kubernetes cluster, and now I have capacity and the ability to take nodes down for maintenance and updates. I don’t have to worry about hardware failures as much. I don’t have a giant server heating up one part of my house.

It has been great.

Those servers are mainly designed for enterprise use cases. For hobby projects, I can understand why someone would choose Hetzner over AWS.

For enterprise environments, however, there is much more to consider. One of the biggest costs you face is your operations team. If you go with Hetzner, you essentially have to rebuild a wide range of infrastructure components yourself (WAF, globally distributed CDN, EFS, RDS, EKS, Transit Gateways, Direct Connect and more).

Of course, you can create your own solutions for all of these. At my company, a mid-size enterprise, we once tried to do exactly that.

WAF: https://github.com/TecharoHQ/anubis

CDN: Hetzner Nodes with Cache in Finnland, USA and GER

RDS: Self-hosted MySQL from Bitnami

EFS: https://github.com/rook/rook

EKS: https://github.com/vitobotta/hetzner-k3s

and 20+ more moving targets of infra software stack and support systems

The result was hiring more than 10 freelancers in addition to 5 of our DevOps engineers to build it all and handling the complexity of such a setup and the keep everything up-to-date, spending hundreds of thousands of dollars. Meanwhile, our AWS team, consisting of only three people working with Terraform, proved far more cost-effective. Not in terms of dollars per CPU core, but in terms of average per project spending dollars once staff costs and everything were included.

I think many of the HN posts that say things like "I saved 90% of my infra bill by moving from AWS to a single Hetzner server" are a bit misleading.

  • Most of those things you listed are work arounds for having a slow server/system.

    For example, if you serve your assets from the server you can skip a cors round trip. If you use an embedded database like sqlite you can shave off 50ms, use dedicated CPU (another 50ms), now you don't need to sever anything from the edge. Because your global latency is much better.

    Managing a single VPS is trivial compared to AWS.