← Back to context

Comment by throwaway787544

3 years ago

I have been doing this for two decades. Let me tell you about bare metal.

Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.

We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.

But you can't just buy 900 servers. You always need more capacity, so you have to predict what your peak load will be, and buy for that. And you have to do it well in advance because it takes a long time to build and ship 900 servers and then assemble them, run burn-in, replace the duds, and prep the OS, firmware, software. And you have to do this every 3 years (minimum) because old hardware gets obsolete and slow, hardware dies, disks die, support contracts expire. But not all at once, because who knows what logistics problems you'd run into and possibly not get all the machines in time to make your projected peak load.

If back then you told me I could turn on 900 servers for 1 month and then turn them off, no planning, no 3 year capital outlay, no assembly, burn in, software configuration, hardware repair, etc etc, I'd call you crazy. Hosting providers existed but nobody could just give you 900 servers in an hour, nobody had that capacity.

And by the way: cloud prices are retail prices. Get on a savings plan or reserve some instances and the cost can be half. Spot instances are a quarter or less the price. Serverless is pennies on the dollar with no management overhead.

If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you, as it can take up to several days for some cloud vendors to get some hardware classes in some regions. And I pray you were doing daily disk snapshots, and can get your dead disks replaced quickly.

That sounds like you have burst load. Per the article, cloud away, great fit.

The point was most people don't have that and even their bursts can fit in a single server. This is my experience as well.

  • The thing that confuses me is, isn't every publicly accessible service bursty on a long timescale? Everything looks seasonal and predictable until you hit the front page of Reddit, and you don't know what day that will be. You don't decide how much traffic you get, the world does.

    • Hitting the front page of reddit is insignificant, it's not like you'll get anywhere near thousands upon thousands of requests each second. If you have a somewhat normal website and you're not doing something weird then it's easily handled with a single low-end server.

      If I get so much traffic that scaling becomes a problem then I'll be happy as I would make a ton of money. No need to build to be able to handle the whole world at the same time, that's just a waste of money in nearly all situations.

      2 replies →

    • If you can't handle traffic from reddit or a larger site, you configured static pages and caching incorrectly, or you run your site on a Raspberry Pi, I guess.

      2 replies →

> I have been doing this for two decades. Let me tell you about bare metal.

> Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.

> We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.

I started working with real (bare metal) servers on real internet loads in 2004 and retired in 2019. While there's truth here, there's also missing information. In 2004, all my servers had 100M ethernet, but in 2019, all my new servers had 4x10G ethernet (2x public, 2x private), actually some of them had 6x, but with 2x unconnected, I dunno why. In the meantime, cpu, nics, and operating systems have improved such that if you're not getting line rate for full mtu packets, it's probably becsause your application uses a lot of cpu, or you've hit a pathological case in the OS (which happens, but if you're running 1000 servers, you've probably got someone to debug that).

If you still need 1000 beefy 10G servers, you've got a pretty formidable load, but splitting it up into many more smaller servers is asking for problems of different kinds. Otoh, if your load really scales to 10x for a month, and you're at that scale, cloud economics are going to work for you.

My seasonal loads were maybe 50% more than normal, but usage trends (and development trends) meant that the seasonal peak would become the new normal soon enough; cloud managing the peaks would help a bit, but buying for the peak and keeping it running for the growth was fine. Daily peaks were maybe 2-3x the off-peak usage, 5 or 6 days a week; a tightly managed cloud provisioning could reduce costs here, but probably not enough to compete with having bare metal for the full day.

Let me take you back to March, 2020. When millions of Americans woke up to find out there was a pandemic and they would be working from home now. Not a problem, I'll just call up our cloud provider and request more cloud compute. You join a queue of a thousand other customers calling in that morning for the exact same thing. A few hours on hold and the CSR tells you they aren't provisioning anymore compute resources. east-us is tapped out, central-europe tapped out hours ago, California got a clue and they already called to reserve so you can't have that either.

I use cloud all the time but there are also blackswan events where your IaaS can't do anymore for you.

  • I never had this problem on AWS though I did see some startups struggle with some more specialized instances. Are midsize companies actually running into issues with non-specialized compute on AWS?

    • Our problem was we had a less than 24 hours to transition to work from home. Someone came down with COVID symptoms and spread it to the office and no one wanted to come in. We didn't have enough laptops for 250+ employees. Developer equivalent 16-core, 32GB RAM , and GPU instances is radically different from general compute web front ends. And we couldn't get enough of them. We had to tell some staff to hang tight while checking AWS+Azure daily.

      These weren't the typical cheap scale out, general compute but virtualized workstations to replace physical, in office equivalents.

    • The company I was at in March 2020 had no issues getting more general purpose compute, and our growth was massive

That's a good point about cloud services being retail. My company gets a very large discount from one of the most well-known cloud providers. This is available to everybody - typically if you commit to 12 months of a minimum usage then you can get substantial discounts. What I know is so far everything we've migrated to the cloud has resulted in significantly reduced total costs, increased reliability, improved scalability, and is easier to enhance and remediate. Faster, cheaper, better - that's been a huge win for us!

The entire point of the article is that your dated example no longer applies: you can fit the vast majority of common loads on a single server now, they are this powerful.

Redundancy concerns are also addressed in the article.

> If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you

You are taking this a bit too literally. The article itself says one server (and backups). So "one" here just means a small number not literally no fallback/backup etc. (obviously... even people you disagree with are usually not morons)

> If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you

There's intermediate ground here. Rent one big server, reserved instance. Cloudy in the sense that you get the benefits of the cloud provider's infrastructure skills and experience, and uptime, plus easy backup provisioning; non-cloudy in that you can just treat that one server instance like your own hardware, running (more or less) your own preferred OS/distro, with "traditional" services running on it (e.g. in our case: nginx, gitea, discourse, mantis, ssh)

> Hosting providers existed but nobody could just give you 900 servers in an hour, nobody had that capacity

> it can take up to several days for some cloud vendors to get some hardware classes in some regions.

I wonder how these two can be true at the same time…

i handled a 8x increase in traffic to my website from a youtuber reviewing our game, by increasing the cache timer and fixing the wiki creating session table entries for logged out users on a wiki that required accounts to edit it.

we already get multiple millions of page hits a months for this happened.

This server had 8 cores but 5 of them were reserved for the 10tb a month in bandwidth game servers running on the same machine.

If you needed 1,000 physical computers to run your webapp, you fucked up somewhere along the line.