Comment by BirAdam
17 hours ago
Just to be honest for a bit here... we also should be asking what kind of scale?
Quite a while ago, before containers were a thing at all, I did systems for some very large porn companies. They were doing streaming video at scale before most, and the only other people working on video at that scale were Youtube.
The general setup for the largest players in that space was haproxy in front of nginx in front of several PHP servers in front of a MySQL database that had one primary r/w with one read only replica. Storage (at that time) was usually done with glusterfs. This was scalable enough at the time for hundreds of thousands of concurrent users, though the video quality was quite a bit lower than what people expect today.
Today at AWS, it is easily possible for people to spend a multiple of the cost of that hardware setup every month for far less compute power and storage.
THANK YOU. People look at me like I’m insane when I tell them that their overly-complicated pipeline could be easily handled by a couple of beefy servers. Or at best, they’ll argue that “this way, they don’t have to manage infrastructure.” Except you do - you absolutely do. It’s just been partially abstracted away, and some parts like OS maintenance are handled (not that that was ever the difficult part of managing servers), but you absolutely need to configure and monitor your specific XaaS you’re renting.
I do consulting in this space, and I'm torn: I make much more money managing infrastructure from clients who insist on AWS. But it's much more enjoyable to work with people who knows how to keep it simple.
I worked on a project for my company (a low volume basic web app) and I suggested we could just start the whole thing on one server. They brought on some Azure consultants and the project ballooned out to months of work and all kinds of services. I’m convinced most of the consultants were just piling on services so they could make more money.
7 replies →
What I’ve always found concerning about managed cases is that the “platform” teams could never explain, in simple terms, how the application was actually deployed.
It was so complex I gave up after a while. That’s never a good sign.
Anyone that says, "they don’t have to manage infrastructure" I would invite them to deal with a multi-environment terraform setup and tell me again that about what they don't have to manage.
While terraform is not ideal it is much much more easy to deal with managed services in AWS than to deal with on premises baremetal servers.
Most are biased because they like dealing with the kind of issues in on premises.
They like dealing with the performance regressions, heat maps, kernel issues etc. Because why not? You are a developer and you need some way to exercise your skills. AWS takes that away and makes you focus on the product. Issues arising from AWS only requires you talking to support. Most developers get into this industry for the love of solving these problems and not actually solving product requirements.
AWS takes away what devs like and brings in more "actual" work.
12 replies →
Those are the ones that also usually tell you you can just stitch together a few SaaS products and it's magic.
3 replies →
I've certainly done some things where outsourcing hosting meant I didn't have to manage infrastructure. For services running on vm instances in gcp vs services running on bare metal managed hosts, there's not a whole lot of difference in terms of management IMHO.
But any infrastructure that the product I support use is infrastructure I need to manage; having it outside my control just makes it that much harder to manage. If it's outside my control, the people who control it better do a much better job than I would at managing it, otherwise it's going to be a much bigger pain.
I'll play devil's advocate a little bit here. But to be clear, I hate AWS and all of their crazy concepts and exorbitant pricing, so ultimately I think I'm on your side.
OS maintenance honestly is a bit hard for me. I need to know what to install for monitoring, I need to maintain scripts or Ansible playbooks. I need to update these and make sure they don't break my setup.
And the big kicker is compliance. I always work under SOC2, ISO27001, PCI-DSS, HIPAA, you name it. These require even more things like intrusion detection, antivirus, very detailed logging, backups, backup testing, web application firewall. When you just use AWS Lambda with DynamoDB, the compliance burden goes down a lot.
Yes, you need to write Ansible initially. But honestly, it’s not that much for your average application server. Turn on unattended-upgrades with anything critical to your application blacklisted, and you won’t have to touch it other than to bump version pins whenever you make a new golden image.
Re: compliance, other than SOC2 being a giant theater of bullshit, agreed that it adds additional work. My point is that the claims of “not having to manage infrastructure” is highly misleading. You get to skip some stuff, yes, but you are paying through the nose in order to avoid writing some additional config files.
Have always felt the same.
I’ve seen an entire company proudly proclaim a modern multicore Xeon with 32GB RAM can do basic monitoring tasks that should have been possible with little more than an Arduino.
Except the 32GB Xeon was far too slow for their implementation...
Let me guess: database tables with no indexes, full scans everywhere?
I swear, before I finished reading your comment, this thought jumped into my mind: ‘oh my, they do host everything with a computer similar to my [pretty old by the way, but still beefy] for-work computer! Impressive!’
Which is, I still believe is perfectly possible to do.
Then, I was ‘what?!’
How did they implement it? That's horrendous.
Working on various teams operating on infrastructure that ranged from a rack in the back of the office, a few beefy servers in a colo, a fleet of Chef-managed VMs, GKE, ECS, and various PaaSes, what I've liked the most about the cloud and containerized workflows is that they wind up being a forcing function for reproducibility, at least to a degree.
While it's absolutely 100% possible to have a "big beefy server architecture" that's reasonably portable, reproducible, and documented, it takes discipline and policy to avoid the "there's a small issue preventing {something important}, I can fix it over SSH with this one-liner and totally document it/add it to the config management tooling later once we've finished with {something else important}" pattern, and once people have been doing that for a while it's a total nightmare to unwind down the line.
Sometimes I want to smash my face into my monitor the 37th time I push an update to some CI code and wait 5 minutes for it to error out, wishing I could just make that band-aid fix, but at the end of the day I can't forget to write down what I did, since it's in my Dockerfile or deploy.yaml or entrypoint.sh or Terraform or whatever.
You have to remove admin rights to your admins then, because scrappy enough DevOps/platform engineers/whatever will totally hand-edit your AWS infra or Kubernetes deployments. I suffered that first hand. And it's even worse that in the old days, because at least back in the day it was expected.
8 replies →
I'm still a pretty big fan of Docker (compose) behind Caddy as a reverse-proxy... I think that containers do offer a lot in terms of application support... even if it's a slightly bigger hoop to get started with in some ways.
1 reply →
I totally agree. So much complexity for generally no good reason [0]. I saw so much of this that I ended up starting a company doing the exact opposite. I figured I could do it better and cheaper, so that that's now what we do!
If anyone wants to bail out of AWS et al and onto a few beefy servers, save some money, and gain a DevOps team in the process, then drop us an email (adam at domain in bio).
[0] My pet theory about the real reason: the hyper-scalers hire all the engineers who have the skills to deploy-to-a-few-beefy-servers, and then charge a 10x multiplier for compute. Companies can then choose between impossible hiring, or paying more. Paying more is easier to stomach, and plenty of rationalisations are available.
> My pet theory about the real reason: the hyper-scalers hire all the engineers who have the skills to deploy-to-a-few-beefy-servers, and then charge a 10x multiplier for compute.
This is also my pet theory, and it’s maddening. They’ve successfully convinced an entire generation of devs that physical servers are super scary and they shouldn’t ever have to look at them.
On the other hand, I know a lot of people who spend more time / salary messing around with their infra than the couple hundred bucks they've saved from not pressing a couple of buttons on vercel / cloudfare
There's a time and place for just deploying quickly to a cloud provider versus trying to manage your infra. It's a nuanced tradeoff that rarely has a clear winner.
Docker compose on a couple nice VPS’s can do a LOT
I look at what I can do with an old mac mini (2011) and it’s quite good. I think the only issue with hardware is technical maintenance, but at the scale of a small companies, that would probably be having a support contract with Dell and co.
Small companies should never forget to ask Dell, etc for discounts. The list prices at many of these companies are aspirational and, even at very small scale, huge discounts are available.
I think it depends on what you are optimizing for. If you are a VC funded startup trying to get to product market fit, spending a bit more on say AWS probably makes sense so you can be “agile”. The opportunity cost there might outweigh infrastructure cost. If you are bootstrapped and cost matters a lot, then different story.
The problem with onsite or colo is always the same. You have to keep fighting the same battle again and again and again. In 5 years when the servers need replaced even though you have already proven it saves orders of magnitude in costs.
I've never once been rewarded for saving 100k+ a month even though I have done exactly that. I have been punished by having to constantly re justify the decision though. I just don't care anymore. I let the "BIG BRAIN MBA's" go ahead and set money on fire in the cloud. It's easier for me. Now I get to hire a team of "cloud architects" to do the infra. At eye bleeding cost increases for a system that will never ever see more than a few thousand users.
You can get a server now with, like, five hundred cores and a fifty terabytes of RAM. It's expensive, but you can get one.
A used server with sixty cores and one terabyte of RAM is a lot cheaper. Couple thousand bucks. I mean, that's still a lot of bucks, but a terabyte for only four digits?
You can get a server now with, like, five hundred cores and a fifty terabytes of RAM. It's expensive, but you can get one.
A used server with sixty cores and one terabyte of RAM is a lot cheaper. Couple thousand bucks. I mean, that's a lot of bucks, but a terabyte for only four digits?
What I say is that we massively underestimate just how fast computers are these days
On the other hand, there is a real crossroad that pops up that HNers tend to dismiss.
A common story is that since day one you just have lightweight app servers handling http requests doing 99% I/O. And your app servers can be deployed on a cheap box anywhere since they're just doing I/O. Maybe they're on Google Cloud Run or a small cluster of $5 VPS. You've built them so that they have zero deps on the machine they're running on.
But then one day you need to do some sort of computations.
One incremental option is to create a worker that can sit on a machine that can crunch the tasks and a pipeline to feed it. This can be seen as operationally complex compared to one machine, but it's also simple in other ways.
Another option is to do everything on one beefy server where your app servers just shell out the work on the same machine. This can be operationally simple in some ways, but not necessarily in all ways.
In 2010 I was managing 100 servers, with many Oracle and Postgres DB, PHP, Apache, all on Solaris and Sun HW. I was constantly impressed by how people were unable to do more or less correct estimations. I had a discussion with my boss, he wanted to buy 8 servers, I argued one was more than enough. The system, after growing massively, was still in 2020 managing the load with just 3 servers. So I would argue, not only today, but 15 years ago already.
Most younger devs just have no concept on how limited hardware we ran services on...
I used to run a webmail system with 2m accounts on hardware with less total capacity (ram, disk, CPU throughput) than my laptop...
What's more: It was a CGI (so new process for every request), and the storage backend spawned separate processes per user.
If you know anything about hardware and look at the typical instances AWS is serving up (other than the ludicrously expensive ones) it's Skylake and older.
I think people have a warped perception of performance, if only because the cloud providers are serving up a shared VM on equipment I'd practically class as vintage computing. You could throw some of the same parts together from eBay and buy the whole system with less than a few months worth of the hourly on-demand cost.
Indeed - they are incredibly fast, it's just buried under layers upon layers of stuff
No worries, another fifteen layers of software abstraction will soak that up pronto.
[dead]
Depending on your regulatory environment, it can be cost-effective to not have to maintain your own data center with 24/7 security response, environmental monitoring, fire suppression systems, etc. (of course, the majority of businesses are probably not interested in things like SOC 2)
This argument comes up a lot, but it feels a bit silly to me. If you want a beefy server you start out with renting one. $150/month will give you a server with 24 core Xeon and 256GB of RAM, in a data center with everything you mentined plus a 24/7 hands-on technician you can book. Preferably rent two servers, because reliablity. Once you outgrow renting servers you start renting rack space in a certified data center with all the same amenities. Once you outgrow that you start renting entire racks, then rows of racks or small rooms inside the DC. Then you start renting portions of the DC. Once you have outgrown that you have to seriously worry about maintaining your own data center. But at that point you have so much scale that this will be the least of your worries
14 replies →
The only companies directly dealing with that type of stuff are the ones already at such a scale where they need to actually build their own data centers. Everyone else is just renting space somewhere that already takes care of those things and you just need to review their ISO/SOC reports.
This kind of argument comes from the cloud provider marketing playbook, not reality.
This is handled by colo.
Around 2013 I was handling bursts up to thousands of requests per second for multi-megabyte file downloads with dynamic authentication using just PHP5, Apache2, and Haproxy, with single-node MySQL (or may have been MariaDB, by then?) as the database, and Redis for caching. On a single mid-range rented server. And Haproxy was only there for operational convenience, you could cut it out and it'd work just as well. No CDN. Rock solid.
My joke but not-actually-a-joke is that the Cloud is where you send a workload that's fast on your laptop, if you need it to be way slower. The performance of these fussy, over-complicated, hard-to-administer[1] systems is truly awful for the price.
[1] They're hypothetically simpler and easier to administer, but I've never seen this in the wild. If anything, we always seem to end up with more hours dedicated to care & feeding of this crap, and more glitchiness and failures, than we would with a handful of rented servers with maybe a CDN in front.
> My joke but not-actually-a-joke is that the Cloud is where you send a workload that's fast on your laptop, if you need it to be way slower.
Not to forget: where you send a workload that is free on your laptop, in order to be charged for it.
Exactly this! The educational product I work on is used by hundreds of thousands of students a day, and the secret to our success is how simple our architecture is. PHP monoliths + Cache (Redis/Memcached) scale super wide basically for free. We don't really think about scalability, it just happens.
I have a friend whose startup had a super complicated architecture that was falling apart at 20 requests per second. I used to be his boss a lifetime ago and he brought me in for a meeting with his team to talk about it. I was just there flabbergasted at "Why is any of this so complicated?!" It was hundreds of microservices, many of them black boxes they'd paid for but had no access to the source. Your app is essentially an async chat app, a fancy forum. It could have been a simple CRUD app.
I basically told my friend I couldn't help, if I can't get to the source of the problematic nodes. They'll need to talk to the vendor. I explained that I'd probably rewrite it from the ground up. They ran out of runway and shut down. He's an AI influencer now...
crud app for async chat app works only that far. when you start getting a lot of customers (companies, etc), big chat rooms, etc - things get complicated.
i saw this kind of system that started as simple crud app, and many years later developers still try to resolve some of the originals sins.
If it takes about long to resolve "original sins" it wasn't simple enough to begin with.
1 reply →
> The general setup for the largest players in that space was haproxy in front of nginx in front of several PHP servers in front of a MySQL database that had one primary r/w with one read only replica.
You'd be surprised that the most stable setups today are run this way. The problem is that this way it's hard to attract investors; they'll assume you are running on old or outdated tech. Everything should be serverless, agentic and, at least on paper, hyperscalable, because that sells further.
> Today at AWS, it is easily possible for people to spend a multiple of the cost of that hardware setup every month for far less compute power and storage.
That is actually the goal of hyperscalers: they are charging you premium for way inferior results. Also, the article stated a very cold truth: "every engineer wants a fashionable CV that will help her get the next job" and you won't definitely get a job if you said: "I moved everything from AWS and put it behind haproxy on one bare-metal box for $100/mo infra bill".
> The problem is that this way it's hard to attract investors; they'll assume you are running on old or outdated tech. Everything should be serverless, agentic and, at least on paper, hyperscalable, because that sells further.
Investors don't give a shit about your stack
Many do. For most it's not the biggest concern (that would be quite weird). AFAIK it's mostly about reducing risk (avoiding complete garbage/duck taped setups)
Source: I know a person who does tech DD for investors, and I've also been asked this question in DD processes.
Are those over engineered systems even actually scalable? I know teams who designed a CQRS architecture using messages queues and a distributed NoSQL database and fail to sustain 10req/s for a read in something that is basically a CRUD application. Heck once someone literally said "But we use Kafka, why aren't we fast?!".
Exactly this, every time I see kafka or similar its a web of 10M microprocesses that take more time in invocation alone than if you just ran the program in one go.
How very kafkaesque.
Eh, they scale between $1000 and $10000 per month fairly easily. I’m not sure about the requests though.
I watched in amusement as the architecture team at $JOB eagerly did a PoC of a distributed RDBMS, only to eventually conclude that the latency was too high. Gee… if only someone had told you that would happen when you mentioned the idea. Oh wait.
Exactly.. it was a lot different when a typical server was 2-4 CPUs and costs more than a luxury car... today you get hundreds of simultaneous threads and upwards of a terabyte of ram for even less, not counting inflation.
You can go a very, very, very long way on 2-3 modern servers with a fast internet connection and a good backup strategy.
Even with a traditional RDBMS like MS-SQL/PostgreSQL, you aren't bottlenecked by the 1-2ghz cpu and spinning rust hard drives. You can easily get to millions of users for a typical site/app with a couple servers just for a read replica/redundancy. As much as I happen to like some of the ergonomics of Mongo from a developer standpoint, or appreciate the scale of Cassandra/.ScyllaDB or even Cockroach... it's just not always necessary early on, or ever.
I've historically been more than happy to reach for RabbitMQ or Redis when you need queueing or caching... but that's still so much simpler than where some microservice architectures have gone. And while I appreciate what Apollo and GraphQL bring to the table, it's over the top for the vast majority of applications.
I've seen an application, a 95% CRUD application, which had about 100-1000 users across the UK, users who would only be using it from 9am-5:30pm, and at that - barely interacting with it. This was backed by literally the most sophisticated and complex architecture I have ever seen in my entire life.
There were 3 instances of cognito. RDS, DynamoDB and S3. The entire architecture diagram would only be legible on an A2 (heck, maybe even A1) page. And that was the high level diagram. The central A4 part of that diagram was a bunch of micro-services for handling different portions of this CRUD application.
This company could afford a system architect as well as a team of developers to work on this full time.
I was genuinely baffled, but this company was in an extremely lucrative industry, so I guess in this case it's fine to just take some of your profits and burn them.
Akin to buying a high performance sports car and never driving it. Maybe it has social value, maybe you just feel good having it.
I’m as likely to talk about human scale as hardware scale, and one of the big issues with human scale is what the consequences are of having the wrong team size in either direction.
When you reduce the man hours per customer you can get farther down your backlog. You can carve people off for new prospective business units. You can absorb the effects of a huge sale or bad press better because you aren’t trying to violate Brooks’ Law nor doing giant layoffs that screw your business numbers.
You have time for people to speculate on big features or more work on reducing the costs further. If you don’t tackle this work early you end up in the armed Queen Problem: running as fast as you can just to stay still.
Around 5 years ago the metagame was to make everything horizontally scalable.
Now it seems things are swinging back the other direction and articles like "Use One Big Server" are getting re-discussed: https://news.ycombinator.com/item?id=45085029
I thought i knew about scaled deployments before i started working where i do now. After staring here, i realized i had no idea what an environment of huuuuge scale actually was. Id been part of multi site deployments and scaled infra, but it was basically potatoes comparatively. We have a team whose platform we, on IT, call the DoS'er of the company. Its responsible for processing hundreds of thousands of test runs a day, and data is fed to a plethora of services after. The scale is so large that they are able to take down critical services, or deeply impact them, purely due to throughput, if a developer goes too far (like say uploading a million small logs to an s3 bucket every minute).
We also have been contacted by AWS having them ask us what the hell we are doing, for a specific set of operations. We do a huge prep for some operations, and the prep feeds massive amounts of data through some AWS services, so much so, they thought we were under attack or had been compromised. Nope, just doin data ingestion!
It is an issue to mistake scalability with resiliency
Yes, we can run twitter on a single server (https://thume.ca/2023/01/02/one-machine-twitter/) No, we do not want to run twitter on a single server
I would argue that even resiliency is a metric that should not be overemphasized in early stages of development. I would rather have a system that suffers occasional outages than one that has perfect resiliency but has added complexity with tradeoffs in costs, complexity and thus development velocity. I think the risks of not being quick enough to product market fit in early stages is bigger than losing customers over short duration outages - except of course if the selling point is resiliency.
Of course this should not be overdone, but there is something to be said for single server + backup setups, and reweriting to scale + resiliency once traction has been established.
It's much easier to build a resilient system with a simple architecture. E.g. run the application on a decent VM or even bare metal server and mirror the whole system between a few different data centers.
The architecture you describe is ok because in the end it is a fairly simple website. Little user interaction, limited amount of content (at most a few million records), few content changes per day. The most complex part is probably to have some kind of search engine but even with 10 million videos an ElasticSearch index is probably no larger than 1GB.
The only problem is that there is a lot of video data.
This is probably also true for 98% of startups.
I think most people don't realise that "10 million" records is small, for a computer.
(That said, I have had to deal with code that included an O(n^2) de-duplication where the test data had n ~= 20,000, causing app startup to take 20 minutes; the other developer insisted there was no possible way to speed this up, later that day I found the problem, asked the CTO if there was a business reason for that de-duplication, removed the de-duplication, and the following morning's stand-up was "you know that 20 minute startup you said couldn't possibly be sped up? Yeah, well, I sped it up and now it takes 200ms")
I thought you were going to say to reduced O(n^2) to O(n*log(n)), but you just deleted the operation. Normally I'd say that's great, but just how much duplicate data is being left around now? Is that OK?
3 replies →
As opposed to what problem?
Like I can honestly have trouble listing too many business problems/areas that would fail to scale with their expected user count, given reasonable hardware and technical competence.
Like YouTube and Facebook are absolute outliers. Famously, stackoverflow used to run on a single beefy machine (and the reason they changed their architecture was not due to scaling issues), and "your" startup ain't needing more scale than SO.
Scaling to a lot of reads is relatively easy, but you get into weird architectural territory once you hit a certain volume of writes. Anything involving monitoring or real-time event analysis can get hairy. That's when stuff like kafka becomes really valuable.
In streaming your website is typically totally divorced from your media serving. Media serving is just a question of cloud storage and pointing at an hls/dash manifest in that object store. Once it starts playing the website itself does almost nothing. Live streaming adds more complexity but it's still not much of a website problem.
Maintaining the media lifecycle, receiving, transcoding, making it available and removing it, is the big task but that's not real-time, it's batch/event processing at best efforts.
The biggest challenges with streaming are maintaining the content catalogue, which aren't just a few million records but rich metadata about the lifecycle and content relationships. Then user management and payments tends to also have a significant overhead, especially when you're talking about international payment processing.
This was before HTML5 and before the browser magically handled a lot of this… so there was definitely a bit more to it. Every company also wanted to have statistics of where people scrub to and all of that. It wasn’t super simple, but yeah, it also wasn’t crazy complex. The point is, scale is achievable without complex inf.