Yep, there's a premium on making your architecture more cloudy. However, the best point for Use One Big Server is not necessarily running your big monolithic API server, but your database.
Use One Big Database.
Seriously. If you are a backend engineer, nothing is worse than breaking up your data into self contained service databases, where everything is passed over Rest/RPC. Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
It is so much easier to do these joins efficiently in a single database than fanning out RPC calls to multiple different databases, not to mention dealing with inconsistencies, lack of atomicity, etc. etc. Spin up a specific reader of that database if there needs to be OLAP queries, or use a message bus. But keep your OLTP data within one database for as long as possible.
You can break apart a stateless microservice, but there are few things as stagnant in the world of software than data. It will keep you nimble for new product features. The boxes that they offer on cloud vendors today for managed databases are giant!
> Seriously. If you are a backend engineer, nothing is worse than breaking up your data into self contained service databases, where everything is passed over Rest/RPC. Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
This works until it doesn't and then you land in the position my company finds itself in where our databases can't handle the load we generate. We can't get bigger or faster hardware because we are using the biggest and fastest hardware you can buy.
Distributed systems suck, sure, and they make querying cross systems a nightmare. However, by giving those aspects up, what you gain is the ability to add new services, features, etc without running into scotty yelling "She can't take much more of it!"
Once you get to that point, it becomes SUPER hard to start splitting things out. All the sudden you have 10000 "just a one off" queries against several domains that are broken by trying carve out a domain into a single owner.
I don't know what's the complexity of your project, but more often than not the feeling of doom coming from hitting that wall is bigger than the actual effort it takes to solve it.
People often feel they should have anticipated and avoid the scaling issues altogether, but moving from a single DB to master/replica model, and/or shards or other solutions is fairly doable, and it doesn't come with worse tradeoffs than if you sharded/split services from the start. It always feels fragile and bolt on compared to the elegance of the single DB, but you'd also have many dirty hacks to have a multi DB setup work properly.
Also, you do that from a position where you usually have money, resources and a good knowledge of your core parts, which is not true when you're still growing full speed.
I've basically been building CRUD backends for websites and later apps since about 1996.
I've fortunately/unfortunately never yet been involved in a project that we couldn't comfortably host using one big write master and a handful of read slaves.
Maybe one day a project I'm involved with will approach "FAANG scale" where that stops working, but you can 100% run 10s of millions of dollars a month in revenue with that setup, at least in a bunch of typical web/app business models.
Early on I did hit the "OMG, we're cooking our database" where we needed to add read cacheing. When I first did that memcached was still written in Perl. So that joined my toolbox very early on (sometime in the late 90s).
Once read cacheing started to not keep up, it was easy enough to make the read cache/memcached layer understand and distribute reads across read slaves. I remember talking to Monty Widenius at The Open Source Conference, I think in Sad Jose around 2001 or so, about getting MySQL replication to use SSL so I could safely replicate to read slaves in Sydney and London from our write master in PAIX.
I have twice committed the sin of premature optimisation and sharded databases "because this one was _for sure_ going to get too big for our usual database setup". It only ever brought unneeded grief and never actually proved necessary.
Many databases can be distributed horizontally if you put in the extra work, would that not solve the problems you're describing? MariaDB supports at least two forms of replication (one master/replica and one multi-master), for example, and if you're willing to shell out for a MaxScale license it's a breeze to load balance it and have automatic failover.
Shouldn't your company have started to split things out and plan for hitting the limit of hardware a couple box sizes back? I feel there is a happy middle ground between "spend months making everything a service for our 10 users" and "welp i looks like we cant upsize the DB anymore, guess we should split things off now?"
That is, one huge table keyed by (for instance) alphabet and when the load gets too big you split it into a-m and n-z tables, each on either their own disk or their own machine.
Then just keep splitting it like that. All of your application logic stays the same … everything stays very flat and simple … you just point different queries to different shards.
I like this because the shards can evolve from their own disk IO to their own machines… and later you can reassemble them if you acquire faster hardware, etc.
> Once you get to that point, it becomes SUPER hard to start splitting things out.
Maybe, but if you split it from the start you die by a thousand cuts, and likely pay the cost up front, even if you’d never get to the volumes that’d require a split.
>Once you get to that point, it becomes SUPER hard to start splitting things out. All the sudden you have 10000 "just a one off" queries against several domains that are broken by trying carve out a domain into a single owner.
But that's survivorship bias and looking back at things from current problems perspective.
You know what's the least future proof and scalable project ? The one that gets canceled because they failed to deliver any value in reasonable time in the early phase. Once you get to "huge project status" you can afford glacial pace. Most of the time you can't afford that early on - so even if by some miracle you knew what scaling issues you're going to have long term and invested in fixing them early on - it's rarely been a good tradeoff in my experience.
I've seen more projects fail because they tangle themselves up in unnecessary complexity early on and fail to execute on core value proposition, than I've seen fail from being unable to manage the tech debt 10 years in. Developers like to complain about the second, but they get fired on the first kind. Unfortunately in todays job market they just resume pad their failures as "relevant experience" and move on to the next project - so there is not correcting feedback.
I'd be curious to know what your company does which generates this volume of data (if you can disclose), what database you are using and how you are planning to solve this issue.
You can get a machine with multiple terabytes of ram and hundreds of CPU cores easily. If you can afford that, you can afford a live replica to switch to during maintenance.
FastComments runs on one big DB in each region, with a hot backup... no issues yet.
Before you go to microservices you can also shard, as others have mentioned.
This is absolutely true - when I was at Bitbucket (ages ago at this point) and we were having issues with our DB server (mostly due to scaling), almost everyone we talked to said "buy a bigger box until you can't any more" because of how complex (and indirectly expensive) the alternatives are - sharding and microservices both have a ton more failure points than a single large box.
I'm sure they eventually moved off that single primary box, but for many years Bitbucket was run off 1 primary in each datacenter (with a failover), and a few read-only copies. If you're getting to the point where one database isn't enough, you're either doing something pretty weird, are working on a specific problem which needs a more complicated setup, or have grown to the point where investing in a microservice architecture starts to make sense.
One issue I've seen with this is that if you have a single, very large database, it can take a very, very long time to restore from backups. Or for that matter just taking backups.
I'd be interested to know if anyone has a good solution for that.
I'm glad this is becoming conventional wisdom. I used to argue this in these pages a few years ago and would get downvoted below the posts telling people to split everything into microservices separated by queues (although I suppose it's making me lose my competitive advantage when everyone else is building lean and mean infrastructure too).
But also it is about pushing the limits of what is physically possible in computing. As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.
Physical efficiency is about keeping data close to where it's processed. Monoliths can make much better use of L1, L2, L3, and ram caches than distributed systems for speedups often in the order of 100X to 1000X.
Sure it's easier to throw more hardware at the problem with distributed systems but the downsides are significant so be sure you really need it.
Now there is a corollary to using monoliths. Since you only have one db, that db should be treated as somewhat sacred, you want to avoid wasting resources inside it. This means being a bit more careful about how you are storing things, using the smallest data structures, normalizing when you can etc. This is not to save disk, disk is cheap. This is to make efficient use of L1,L2,L3 and ram.
I've seen boolean true or false values saved as large JSON documents. {"usersetting1": true, "usersetting2":fasle "setting1name":"name" etc.} with 10 bits of data ending up as a 1k JSON document. Avoid this! Storing documents means, the keys, the full table schema is in every row. It has its uses but if you can predefine your schema and use the smallest types needed, you are gaining much performance mostly through much higher cache efficiency!
It's not though. You're just seeing the most popular opinion on HN.
In reality it is nuanced like most real-world tech decisions are. Some use cases necessitate a distributed or sharded database, some work better with a single server and some are simply going to outsource the problem to some vendor.
My hunch is that computers caught up. Back in the early 2000's horizontal scaling was the only way. You simply couldn't handle even reasonably mediocre loads on a single machine.
As computing becomes cheaper, horizontal scaling is starting to look more and more like unnecessary complexity for even surprisingly large/popular apps.
I mean you can buy a consumer off-the-shelf machine with 1.5TB of memory these days. 20 years ago, when microservices started gaining popularity, 1.5TB RAM in a single machine was basically unimaginable.
'over the wire' is less obvious than it used to be.
If you're in k8s pod, those calls are really kernel calls. Sure you're serializing and process switching where you could be just making a method call, but we had to do something.
I'm seeing less 'balls of mud' with microservices. Thats not zero balls of mud. But its not a given for almost every code base I wander into.
>"I'm glad this is becoming conventional wisdom. "
Yup, this is what I've always done and it works wonders. Since I do not have bosses, just a clients I do not give a flying fuck about latest fashion and do what actually makes sense for me and said clients.
I've never understood this logic for webapps. If you're building a web application, congratulations, you're building a distributed system, you don't get a choice. You can't actually use transactional integrity or ACID compliance because you've got to send everything to and from your users via HTTP request/response. So you end up paying all the performance, scalability, flexibility, and especially reliability costs of an RDBMS, being careful about how much data you're storing, and getting zilch for it, because you end up building a system that's still last-write-wins and still loses user data whenever two users do anything at the same time (or you build your own transactional logic to solve that - exactly the same way as you would if you were using a distributed datastore).
Distributed systems can also make efficient use of cache, in fact they can do more of it because they have more of it by having more nodes. If you get your dataflow right then you'll have performance that's as good as a monolith on a tiny dataset but keep that performance as you scale up. Not only that, but you can perform a lot better than an ACID system ever could, because you can do things like asynchronously updating secondary indices after the data is committed. But most importantly you have easy failover from day 1, you have easy scaling from day 1, and you can just not worry about that and focus on your actual business problem.
Relational databases are largely a solution in search of a problem, at least for web systems. (They make sense as a reporting datastore to support ad-hoc exploratory queries, but there's never a good reason to use them for your live/"OLTP" data).
>As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.
Even accounting for CDNs, a distributed system is inherently more capable of bringing data closer to geographically distributed end users, thus lowering latency.
I think a strong test a lot of "let's use Google scale architecture for our MVP" advocates fail is: can your architecture support a performant paginated list with dynamic sort, filter and search where eventual consistency isn't acceptable?
Pretty much every CRUD app needs this at some point and if every join needs a network call your app is going to suck to use and suck to develop.
I’ve found the following resource invaluable for designing and creating “cloud native” APIs where I can tackle that kind of thing from the very start without a huge amount of hassle https://google.aip.dev/general
I don't believe you. Eventual consistency is how the real world works, what possible use case is there where it wouldn't be acceptable? Even if you somehow made the display widget part of the database, you can't make the reader's eyeballs ACID-compliant.
> if every join needs a network call your app is going to suck to use and suck to develop.
And yet developers do this every single day without any issue.
It is bad practice to have your authentication database be the same as your app database. Or you have data coming from SaaS products, third party APIs or a cloud service. Or even simply another service in your stack. And with complex schemas often it's far easier to do that join in your application layer.
I've seen this evolve into tightly coupled microservices that could be deployed independently in theory, but required exquisite coordination to work.
If you want them to be on a single server, that's fine, but having multiple databases or schemas will help enforce separation.
And, if you need one single place for analytics, push changes to that space asynchronously.
Having said that, I've seen silly optimizations being employed that make sense when you are Twitter, and to nobody else. Slice services up to the point they still do something meaningful in terms of the solution and avoid going any further.
I have done both models. My previous job we had a monolith on top of a 1200 table database. Now I work in an ecosystem of 400 microservices, most with their own database.
What it fundamentally boils down to is that your org chart determines your architecture. We had a single team in charge of the monolith, and it was ok, and then we wanted to add teams and it broke down. On the microservices architecture, we have many teams, which can work independently quite well, until there is a big project that needs coordinated changes, and then the fun starts.
Like always there is no advice that is absolutely right. Monoliths, microservices, function stores. One big server vs kubernetes. Any of those things become the right answer in the right context.
Although I’m still in favor of starting with a modular monolith and splitting off services when it becomes apparent they need to change at a different pace from the main body. That is right in most contexts I think.
To clarify the advice, at least how I believe it should be done…
Use One Big Database Server…
… and on it, use one software database per application.
For example, one Postgres server can host many databases that are mostly* independent from each other. Each application or service should have its own database and be unaware of the others, communicating with them via the services if necessary. This makes splitting up into multiple database servers fairly straightforward if needed later. In reality most businesses will have a long tail of tiny databases that can all be on the same server, with only bigger databases needing dedicated resources.
*you can have interdependencies when you’re using deep features sometimes, but in an application-first development model I’d advise against this.
There's no need for "microservices" in the first place then. That's just logical groupings of functionality that can be separate as classes, namespaces or other modules without being entirely separate processes with a network boundary.
Breaking apart a stateless microservice and then basing it around a giant single monolithic database is pretty pointless - at that stage you might as well just build a monolith and get on with it as every microservice is tightly coupled to the db.
To note that quite a bit of the performance problems come when writing stuff. You can get away with A LOT if you accept 1. the current service doesn't do (much) writing and 2. it can live with slightly old data. Which I think covers 90% of use cases.
So you can end up with those services living on separate machines and connecting to read only db replicas, for virtually limitless scalability. And when it realizes it needs to do an update, it either switches the db connection to a master, or it forwards the whole request to another instance connected to a master db.
(1) Different programming languages e.g. you're written your app in Java but now you need to do something for which the perfect Python library is available.
(2) Different parts of your software need different types of hardware. Maybe one part needs a huge amount of RAM for a cache, but other parts are just a web server. It'd be a shame to have to buy huge amounts of RAM for every server. Splitting the software up and deploying the different parts on different machines can be a win here.
I reckon the average startup doesn't need any of that, not suggesting that monoliths aren't the way to go 90% of the time. But if you do need these things, you can still go the microservices route, but it still makes sense to stick to a single database if at all possible, for consistency and easier JOINs for ad-hoc queries, etc.
Agree. Nothing worse than having different programs changing data in the same database. The database should not be an integration point between services.
I disagree. Suppose you have an enormous DB that's mainly written to by workers inside a company, but has to be widely read by the public outside. You want your internal services on machines with extra layers of security, perhaps only accessible by VPN. Your external facing microservices have other things like e.g. user authentication (which may be tied to a different monolithic database), and you want to put them closer to users, spread out in various data centers or on the edge. Even if they're all bound to one database, there's a lot to recommend keeping them on separate, light cheap servers that are built for http traffic and occasional DB reads. And even more so if those services do a lot of processing on the data that's accessed, such as building up reports, etc.
yah, this is something i learned when designing my first server stack (using sun machines) for a real business back during the dot-com boom/bust era. our single database server was the beefiest machine by far in the stack, 5U in the rack (we also had a hot backup), while the other servers were 1U or 2U in size. most of that girth was for memory and disk space, with decent but not the fastest processors.
one big db server with a hot backup was our best tradeoff for price, performance, and reliability. part of the mitigation was that the other servers could be scaled horizontally to compensate for a decent amount of growth without needing to scale the db horizontally.
Definitely use a big database, until you can't. My advice to anyone starting with a relational data store is to use a proxy from day 1 (or some point before adding something like that becomes scary).
When you need to start sharding your database, having a proxy is like having a super power.
We see both use cases: single large database vs multiple small, decoupled. I agree with the sentiment that a large database offer simplicity, until access patterns change.
We focus on distributing database data to the edge using caching. Typically this eliminates read-replicas and a lot of the headache that goes with app logic rewrites or scaling "One Big Database".
Yep, with a passive replica or online (log) backup.
Keeping things centralized can reduce your hardware requirement by multiple orders of magnitude. The one huge exception is a traditional web service, those scale very well, so you may not even want to get big servers for them (until you need them).
If you do this then you'll have the hardest possible migration when the time comes to split it up. It will take you literally years, perhaps even a decade.
Shard your datastore from day 1, get your dataflow right so that you don't need atomicity, and it'll be painless and scale effortlessly. More importantly, you won't be able to paper over crappy dataflow. It's like using proper types in your code: yes, it takes a bit more effort up-front compared to just YOLOing everything, but it pays dividends pretty quickly.
This is true IFF you get to the point where you have to split up.
I know we're all hot and bothered about getting our apps to scale up to be the next unicorn, but most apps never need to scale past the limit of a single very high-performance database. For most people, this single huge DB is sufficient.
Also, for many (maybe even most) applications, designated outages for maintenance are not only acceptable, but industry standard. Banks have had, and continue to have designated outages all the time, usually on weekends when the impact is reduced.
Sure, what I just wrote is bad advice for mega-scale SaaS offerings with millions of concurrent users, but most of us aren't building those, as much as we would like to pretend that we are.
I will say that TWO of those servers, with some form of synchronous replication, and point in time snapshots, are probably a better choice, but that's hair-splitting.
(and I am a dyed in the wool microservices, scale-out Amazon WS fanboi).
> If you do this then you'll have the hardest possible migration when the time comes to split it up. It will take you literally years, perhaps even a decade.
At which point a new OneBigServer will be 100x as powerful, and all your upfront work will be for nothing.
It’s never one big database. Inevitably there are are backups, replicas, testing environments, staging, development. In an ideal unchanging world where nothing ever fails and workload is predictable then the one big database is also ideal.
What happens in the real world is that the one big database becomes such a roadblock to change and growth that organisations often throw away the whole thing and start from scratch.
> It’s never one big database. Inevitably there are are backups, replicas, testing environments, staging, development. In an ideal unchanging world where nothing ever fails and workload is predictable then the one big database is also ideal.
But if you have many small databases, you need
> backups, replicas, testing environments, staging, development
all times `n`. Which doesn't sound like an improvement.
> What happens in the real world is that the one big database becomes such a roadblock to change and growth that organisations often throw away the whole thing and start from scratch.
Bad engineering orgs will clutch defeat from the jaws of victory no matter what the early architectural decisions were. The one vs many databases/services is almost moot entirely.
Just FYI, you can have one big database, without running it on one big server. As an example, databases like Cassandra are designed to be scaled horizontally (i.e. scale out, instead of scale up).
There are trade-offs when you scale horizontally even if a database is designed for it. For example, DataStax's Storage Attached Indexes or Cassandra's hidden-table secondary indexing allow for indexing on columns that aren't part of the clustering/partitioning, but when you're reading you're going to have to ask all the nodes to look for something if you aren't including a clustering/partitioning criteria to narrow it down.
You've now scaled out, but you now have to ask each node when searching by secondary index. If you're asking every node for your queries, you haven't really scaled horizontally. You've just increased complexity.
Now, maybe 95% of your queries can be handled with a clustering key and you just need secondary indexes to handle 5% of your stuff. In that case, Cassandra does offer an easy way to handle that last 5%. However, it can be problematic if people take shortcuts too much and you end up putting too much load on the cluster. You're also putting your latency for reads at the highest latency of all the machines in your cluster. For example, if you have 100 machines in your cluster with a mean response time of 2ms and a 99th percentile response time of 150ms, you're potentially going to be providing a bad experience to users waiting on that last box on secondary index queries.
This isn't to say that Cassandra isn't useful - Cassandra has been making some good decisions to balance the problems engineers face. However, it does come with trade-offs when you distribute the data. When you have a well-defined problem, it's a lot easier to design your data for efficient querying and partitioning. When you're trying to figure things out, the flexibility of a single machine and much cheaper secondary index queries can be important - and if you hit a massive scale, you figure out how you want to partition it then.
Cassandra may be great when you have to scale your database that you no longer develop significantly. The problem with this DB system is that you have to know all the queries before you can define the schema.
A relative worked for a hedge fund that used this idea. They were a C#/MSSQL shop, so they just bought whatever was the biggest MSSQL server at the time, updating frequently. They said it was a huge advantage, where the limit in scale was more than offset by productivity.
I think it's an underrated idea. There's a lot of people out there building a lot of complexity for datasets that in the end are less than 100 TB.
But it also has limits. Infamously Twitter delayed going to a sharded architecture a bit too long, making it more of an ugly migration.
I do, it is running on the same big (relatively) server as my native C++ backend talking to the database. The performance smokes your standard cloudy setup big time. Serving thousand requests per second on 16 core without breaking sweat. I am all for monoliths running on real no cloudy hardware. As long as the business scale is reasonable and does not approach FAANG (like for 90% of the businesses) this solution is superior to everything else money, maintenance, development time wise.
I agree with this sentiment but it is often misunderstood as a means to force everything into a single database schema. More people need to learn about logically separating schemas with their database servers!
Another area for consolidation is auth. Use one giant keycloak, with individual realms for every one of the individual apps you are running. Your keycloak is back ended by your one giant database.
I agree that 1BDB is a good idea, but having one ginormous schema has its own costs. So I still think data should be logically partitioned between applications/microservices - in PG terms, one “cluster” but multiple “databases”.
We solved the problem of collecting data from the various databases for end users by having a GraphQL layer which could integrate all the data sources. This turned out to be absolutely awesome. You could also do something similar using FDW. The effort was not significant relative to the size of the application.
The benefits of this architecture were manifold but one of the main ones is that it reduces the complexity of each individual database, which dramatically improved performance, and we knew that if we needed more performance we could pull those individual databases out into their own machine.
I'd say, one big database per service. Often times there are natural places to separate concerns and end up with multiple databases. If you ever want to join things for offline analysis, it's not hard to make a mapreduce pipeline of some kind that reads from all of them and gives you that boundless flexibility.
Then if/when it comes time for sharding, you probably only have to worry about one of those databases first, and you possibly shard it in a higher-level logical way that works for that kind of service (e.g. one smaller database per physical region of customers) instead of something at a lower level with a distributed database. Horizontally scaling DBs sound a lot nicer than they really are.
>>(they don't know how your distributed databases look, and oftentimes they really do not care)
Nor should they, it's the engineer's/team's job to provide the database layer to them with high levels of service without them having to know the details
I'm pretty happy to pay a cloud provider to deal with managing databases and hosts. It doesn't seem to cause me much grief, and maybe I could do it better but my time is worth more than our RDS bill. I can always come back and Do It Myself if I run out of more valuable things to work on.
Similarly, paying for EKS or GKE or the higher-level container offerings seems like a much better place to spend my resources than figuring out how to run infrastructure on bare VMs.
Every time I've seen a normal-sized firm running on VMs, they have one team who is responsible for managing the VMs, and either that team is expecting a Docker image artifact or they're expecting to manage the environment in which the application runs (making sure all of the application dependencies are installed in the environment, etc) which typically implies a lot of coordination between the ops team and the application teams (especially regarding deployment). I've never seen that work as smoothly as deploying to ECS/EKS/whatever and letting the ops team work on automating things at a higher level of abstraction (automatic certificate rotation, automatic DNS, etc).
That said, I've never tried the "one big server" approach, although I wouldn't want to run fewer than 3 replicas, and I would want reproducibility so I know I can stand up the exact same thing if one of the replicas go down as well as for higher-fidelity testing in lower environments. And since we have that kind of reproducibility, there's no significant difference in operational work between running fewer larger servers and more smaller servers.
"Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care)."
This isn't a problem if state is properly divided along the proper business domain and the people who need to access the data have access to it. In fact many use cases require it - publicly traded companies can't let anyone in the organization access financial info and healthcare companies can't let anyone access patient data. And of course are performance concerns as well if anyone in the organization can arbitrarily execute queries on any of the organization's data.
I would say YAGNI applies to data segregation as well and separations shouldn't be introduced until they are necessary.
"combine these data sources" doesn't necessarily mean data analytics. Just as an example, it could be something like "show a badge if it's the user's birthday", which if you had a separate microservice for birthdays would be much harder than joining a new table.
At my current job we have four different databases so I concur with this assessment. I think it's okay to have some data in different DBs if they're significantly different like say the user login data could be in its own database. But anything that we do which is a combination of e-commerce and testing/certification I think they should be in one big database so I can do reasonable queries for information that we need. This doesn't include two other databases we have on-prem which one is a Salesforce setup and another is an internal application system that essentially marries Salesforce to that. It's a weird wild environment to navigate when adding features.
> Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
I'm not sure how to parse this. What should "asks" be?
Mostly agree, but you have to be very strict with the DB architecture. Have very reasonable schema. Punish long running queries. If some dev group starts hammering the DB cut them off early on, don't let them get away with it and then refuse to fix their query design.
The biggest nemesis of big DB approach are dev teams who don't care about the impact of their queries.
Also move all the read-only stuff that can be a few minutes behind to a separate (smaller) server with custom views updated in batches (e.g. product listings). And run analytics out of peak hours and if possible in a separate server.
The rule is: Keep related data together. Exceptions are: Different customers (usually don't require each others data) can be isolated. And if the database become the bottleneck you can separate unrelated services.
Surely having separate DBs all sit on the One Big Server is preferable in many cases. For cases where you really to extract large amounts of data that is derived from multiple DBs, there's no real harm in having some cross-DB joins defined in views somewhere. If there are sensible logical ways to break a monolithic service into component stand-alone services, and good business reasons to do (or it's already been designed that way), then having each talk to their own DB on a shared server should be able to scale pretty well.
If you get your services right there is little or no communications between the services since a microservice should have all the data it needs in it's own store.
Hardware engineers are pushing the absolute physical limits of getting state (memory/storage) as close as possible to compute. A monumental accomplishment as impactful as the invention of agriculture and the industrial revolution.
Software engineers: let's completely undo all that engineering by moving everything apart as far as possible. Hmmm, still too fast. Let's next add virtualization and software stacks with shitty abstractions.
Fast and powerful browser? Let's completely ignore 20 years of performance engineering and reinvent...rendering. Hmm, sucks a bit. Let's add back server rendering. Wait, now we have to render twice. Ah well, let's just call it a "best practice".
The mouse that I'm using right now (an expensive one) has a 2GB desktop Electron app that seems to want to update itself twice a week.
The state of us, the absolute garbage that we put out, and the creative ways in which we try to justify it. It's like a mind virus.
Actually, for those who push for these cloudy solutions, they do that in part to make data close to you. I am talking mostly about CDNs, I don't thing YouTube and Netflix would have been possible without them.
Google is a US company, but you don't want people in Australia to connect to the other side of the globe every time they need to access Google services, it would be an awful waste of intercontinental bandwidth. Instead, Google has data centers in Australia to serve people in Australia, and they only hit US servers when absolutely needed. And that's when you need to abstract things out. If something becomes relevant in Australia, move it in there, and move it out when it no longer matters. When something big happens, copy it everywhere, and replace the copies by something else as interest wanes.
Big companies need to split everything, they can't centralize because the world isn't centralized. The problem is when small businesses try to do the same because "if Google is so successful doing that, it must be right". Scale matters.
Agreed and I think it's easier to compare tech to the movie industry. Just look at all the crappy movies they produce with IMDB ratings below 5 out of 10, that is movies that nobody's going to even watch; then there are the shitty blockbusters with expensive marketing and greatly simplified stories optimized for mindless blockbuster movie goers; then there are rare gems, true works of art that get recognized at festivals at best but usually not by the masses. The state of the movie industry is overall pathetic, and I see parallels with the tech here.
> Software engineers: let's completely undo all that engineering by moving everything apart as far as possible. Hmmm, still too fast. Let's next add virtualization and software stacks with shitty abstractions.
That's because the concept which is even more impactful than agriculture and the computer, and makes them and everything else in our lives, is abstraction. It makes it possible to reason about large and difficult problems, to specialize, to have multiple people working on them.
Computer hardware is as full of abstraction and separation and specialization as software is. The person designing the logic for a multiplier unit has no more need to know how transistors are etched into silicon than a javascript programmer does.
Heh, there's a mention here to Andy and Bill's Law, "What Andy giveth, Bill taketh away," which is a reference to Andy Grove (Intel) and Bill Gates (Microsoft).
Since I have a long history with Sun Microsystems, upon seeing "Andy and Bill's Law" I immediately thought this was a reference to Andy Bechtolsheim (Sun hardware guy) and Bill Joy (Sun software guy). Sun had its own history of software bloat, with the latest software releases not fitting into contemporary hardware.
> The mouse that I'm using right now (an expensive one) has a 2GB desktop Electron app that seems to want to update itself twice a week.
I'm using a Logitech MX Master 3, and it comes with the "Logi Options+" to configure the mouse. I'm super frustrated with the cranky and slow app. It updates every other day and crashes often.
The experience is much better when I can configure the mouse with an open-source driver [^0] while using Linux.
I use Logi Options too, but while it's stable for me, it still uses a bafflingly high amount of CPU. But if I don't run Logi Options, then mouse buttons 3+4 stop working :-/
It's been like that for years.
Logitech's hardware is great, so I don't know why they think it's OK to push out such shite software.
Let me add fuel to the fire. When I started my career, users were happy to select among a handful of 8x8 bitmap font. Nowadays, users expect to see a scalable male-doctor-skin-ton-1 emoji. The former can be implemented by bliting 8 bytes from ROM. The latter requires an SVG engine -- just to render one character.
While bloatware cannot be excluded, let's not forget that user expectations have temendously increased.
We're not a very serious industry. Despite uhm, it pretty much running the world. We're a joke. Sometimes I feel it doesn't even earn the term "engineering" at all, and rather than improving, it seems to get ever worse.
Which really is a stunning accomplishment in a backdrop of spectacular hardware advances, ever more educated people, and other favorable ingredients.
Software engineers don't want to be managing physical hardware and often need to run highly available services. When a team lacks the skill, geographic presence or bandwidth to manage physical servers but needs to deliver a highly-available service, I think the cloud offers legitimate improvements in operations with downsides such as increased cost and decreased performance per unit of cost.
> However, cloud providers have often had global outages in the past, and there is no reason to assume that cloud datacenters will be down any less often than your individual servers.
A nice thing about being in a big provider is when they go down a massive portion of the internet goes down, and it makes news headlines. Users are much less likely to complain about your service being down when it's clear you're just caught up in the global outage that's affecting 10 other things they use.
This is a huge one -- value in outsourcing blame. If you're down because of a major provider outage in the news, you're viewed more as a victim of a natural disaster rather than someone to be blamed.
I hear this repeated so many times at my workplace, and it's so totally and completely uninformed.
Customers who have invested millions of dollars into making their stack multi-region, multi-cloud, or multi-datacenter aren't going to calmly accept the excuse that "AWS Went Down" when you can't deliver the services you contractually agreed to deliver. There are industries out there where having your service casually go down a few times a year is totally unacceptable (Healthcare, Government, Finance, etc). I worked adjacent to a department that did online retail a while ago and even an hour of outage would lose us $1M+ in business.
Agreed. Recently I was discussing the same point with a non-technical friend who was explaining that his CTO had decided to move from Digital Ocean to AWS, after DO experienced some outage. Apparently the CEO is furious at him and has assumed that DO are the worst service provider because their services were down for almost an entire business day. The CTO probably knows that AWS could also fail in a similar fashion, but by moving to AWS it becomes more or less an Act of God type of situation and he can wash his hands of it.
I find this entire attitude disappointing. Engineering has moved from "provide the best reliability" to "provide the reliability we won't get blamed for the failure of". Folks who have this attitude missed out on the dang ethics course their college was teaching.
If rolling your own is faster, cheaper, and more reliable (it is), then the only justification for cloud is assigning blame. But you know what you also don't get? Accolades.
I throw a little party of one here when Office 365 or Azure or AWS or whatever Google calls it's cloud products this week is down but all our staff are able to work without issue. =)
If you work in B2B you can put the blame on Amazon and your customers will ask "understandable, take the necessary steps to make sure it doesn't happen again". AWS going down isn't an act of God, it's something you should've planned for, especially if it happened before.
I don't really have much to do with contracts - but my company is stating that we have up time of 99.xx%.
In terms of contract customers don't care if I have Azure/AWS or I keep my server in the box under the stairs. Yes they do due diligence and would not buy my services if I keep it in shoe box.
But then if they loose business they come to me .. I can go after Azure/AWS but I am so small they will throw some free credits and me and tell to go off.
Maybe if you are in B2C area then yeah - your customers will probably shrug and say it was M$ or Amazon if you write sad blog post with excuses.
Users are much more sympathetic to outages when they're widespread. But, if there's a contractual SLA then their sympathy doesn't matter. You have to meet your SLA. That usually isn't a big problem as SLAs tend to account for some amount of downtime, but it's important to keep the SLA in mind.
There is also the consideration that this isn't even an argument of "other things are down too!" or "outsourcing blame" as much as, depending on what your service is of course, you are unlikely to be operating in a bubble. You likely have some form of external dependencies, or you are an external dependency, or have correlated/cross-dependency usage with another service.
Guaranteeing isolation between all of these different moving parts is very difficult. Even if you're not directly affected by a large cloud outage, it's becoming less-and-less common that you, or your customers, are truely isolated.
As well, if your AWS-hosted service mostly exists to service AWS-hosted customers, and AWS is down, it doesn't matter if you are down. None of your customers are operational anyways. Is this a 100% acceptable solution? Of course not. But for 95% of services/SaaS out there, it really doesn't matter.
Depends on how technical your customer base is. Even as a developer I would tend not to ascribe too much signal to that message. All it tells me is that you don't use AWS.
"We stayed online when GCP, AWS, and Azure go down" is a different story. On the other hand, if those three go down simultaneously, I suspect the state of the world will be such that I'm not worried about the internet.
You also have to calculate in the complexity of running thousands of servers vs running just one server. If you run just one server it's unlikely to go down even once in it's lifetime. Meanwhile cloud providers are guaranteed to have outages due to the share complexity of managing thousands of servers.
When migrating from [no-name CRM] to [big-name CRM] at a recent job, the manager pointed out that when [big-name CRM] goes down, it's in the Wall Street Journal, and when [no-name] goes down, it's hard to get their own Support Team to care!
No. Your users have no idea that you rely on AWS (they don't even know what it is), and they don't think of it as a valid or reasonable excuse as to why your service is down.
If you are not maxing out or even getting above 50% utilization of 128 physical cores (256 threads), 512 GB of memory, and 50 Gbps of bandwidth for $1,318/month, I really like the approach of multiple low-end consumable computers as servers. I have been using arrays of Intel NUCs at some customer sites for years with considerable cost savings over cloud offerings. Keep an extra redundant one in the array ready to swap out a failure.
Another often overlooked option is that in several fly-over states it is quite easy and cheap to register as a public telecommunication utility. This allows you to place a powered pedestal in the public right-of-way, where you can get situated adjacent to an optical meet point and get considerable savings on installation costs of optical Internet, even from a tier 1 provider. If your server bandwidth is peak utilized during business hours and there is an apartment complex nearby you can use that utility designation and competitively provide residential Internet service to offset costs.
> competitively provide residential
> Internet service to offset costs.
I uh. Providing residential Internet for an apartment complex feels like an entire business in and of itself and wildly out of scope for a small business? That's a whole extra competency and a major customer support commitment. Is there something I'm missing here?
It depends on the scale - it does not have to be a major undertaking. You are right, it is a whole extra competency and a major customer support commitment, but for a lot of the entrepreneurial folk on HN quite a rewarding and accessible learning experience.
The first time I did anything like this was in late 1984 in a small town in Iowa where GTE was the local telecommunication utility. Absolutely abysmal Internet service, nothing broadband from them at the time or from the MSO (Mediacom). I found out there was a statewide optical provider with cable going through the town. I incorporated an LLC, became a utility and built out less than 2 miles of single mode fiber to interconnect some of my original software business customers at first. Our internal moto was "how hard can it be?" (more as a rebuke to GTE). We found out. The whole 24x7 public utility thing was very difficult for just a couple of guys. But it grew from there. I left after about 20 years and today it is a thriving provider.
Technology has made the whole process so much easier today. I am amazed more people do not do it. You can get a small rack-mount sheet metal pedestal with an AC power meter and an HVAC unit for under $2k. Being a utility will allow you to place that on a concrete pad or vault in the utility corridor (often without any monthly fee from the city or county). You place a few bollards around it so no one drives into it. You want to get quotes from some tier 1 providers [0]. They will help you identify the best locations to engineer an optical meet and those are the locations you run by the city/county/state utilities board or commission.
For a network engineer wanting to implement a fault tolerant network, you can place multiple pedestals at different locations on your provider's/peer's network to create a route diversified protected network.
After all, when you are buying expensive cloud based services that literally is all your cloud provider is doing ... just on a completely more massive scale. The barrier to entry is not as high as you might think. You have technology offerings like OpenStack [1], where multiple competitive vendors will also help you engineer a solution. The government also provides (financial) support [2].
The best perk is the number of parking spaces the requisite orange utility traffic cone opens up for you.
You're missing "apartment complex" - you as the service provider contract with the apartment management company to basically cover your costs, and they handle the day-to-day along with running the apartment building.
Done right, it'll be cheaper for them (they can advertise "high speed internet included!" or whatever) and you won't have much to do assuming everything on your end just works.
The days where small ISPs provided things like email, web hosting, etc, are long gone; you're just providing a DHCP IP and potentially not even that if you roll out carrier-grade NAT.
I have only done a few midwestern states. Call them and ask [0] - (919) 733-7328. You may want to first call your proposed county commissioner's office or city hall (if you are not rural), and ask them who to talk with about a new local business providing Internet service. If you can show the Utilities Commission that you are working with someone at the local level I have found they will treat you more seriously. In certain rural counties, you can even qualify for funding from the Rural Utilities Service of the USDA.
EDIT: typos + also most states distinguish between facilities-based ISP's (ie with physical plant in the regulated public right-of-way) and other ISPs. Tell them you are looking to become a facilities-based ISP.
We have a different take on running "one big database." At ScyllaDB we prefer vertical scaling because you get better utilization of all your vCPUs, but we still will keep a replication factor of 3 to ensure that you can maintain [at least] quorum reads and writes.
So we would likely recommend running 3x big servers. For those who want to plan for failure, though, they might prefer to have 6x medium servers, because then the loss of any one means you don't take as much of a "torpedo hit" when any one server goes offline.
So it's a balance. You want to be big, but you don't want to be monolithic. You want an HA architecture so that no one node kills your entire business.
I also suggest that people planning systems create their own "torpedo test." We often benchmark to tell maximal optimum performance, presuming that everything is going to go right.
But people who are concerned about real-world outage planning may want to "torpedo" a node to see how a 2-out-of-3-nodes-up cluster operates, versus a 5-out-of-6-nodes-up cluster.
This is like planning for major jets, to see if you can work with 2 of 3 engines, or 1 of 2.
Obviously, if you have 1 engine, there is nothing you can do if you lose that single point of failure. At that point, you are updating your resume, and checking on the quality of your parachute.
I think this is the right approach, and I really admire the work you do at ScyllaDB. For something truly critical, you really do want to have multiple nodes available (at least 2, and probably 3 is better). However, you really should want to have backup copies in multiple datacenters, not just the one.
Today, if I were running something that absolutely needed to be up 24/7, I would run a 2x2 or 2x3 configuration with async replication between primary and backup sites.
Exactly. Regional distribution can be vital. Our customer Kiwi.com had a datacenter fire. 10 of their 30 nodes were turned to a slag heap of ash and metal. But 20 of 30 nodes in their cluster were in completely different datacenters so they lost zero data and kept running non-stop. This is a rare story, but you do NOT want to be one of the thousands of others that only had one datacenter, and their backups were also stored there and burned up with their main servers. Oof!
Well said. Caring about vertical scale doesn't mean you have to throw out a lot of the lessons learned about still being horizontally scalable or high availability.
Some comments wrongly equate bare-metal with on-premise. Bare-metal servers can be rented out, collocated, or installed on-premise.
Also, when renting, the company takes care of hardware failures. Furthermore, as hard disk failures are the most common issue, you can have hot spares and opt to let damaged disks rot, instead of replacing them.
For example, in ZFS, you can mirror disks 1 and 2, while having 3 and 4 as hot spares, with the following command:
Disregarding the security risks of multi-tenant cloud instances, bare-metal is more cost-effective once your cloud bill exceeds $3,000 per year, which is the cost of renting two bare-metal servers.
---
Here's how you can create a two-server infrastructure:
IMO microservices primarily solve organizational problems, not technical problems.
They allow a team to release independently of other teams that have or want to make different risk/velocity tradeoffs. Also smaller units being released means fewer changes and likely fewer failed releases.
I have been doing this for two decades. Let me tell you about bare metal.
Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.
We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.
But you can't just buy 900 servers. You always need more capacity, so you have to predict what your peak load will be, and buy for that. And you have to do it well in advance because it takes a long time to build and ship 900 servers and then assemble them, run burn-in, replace the duds, and prep the OS, firmware, software. And you have to do this every 3 years (minimum) because old hardware gets obsolete and slow, hardware dies, disks die, support contracts expire. But not all at once, because who knows what logistics problems you'd run into and possibly not get all the machines in time to make your projected peak load.
If back then you told me I could turn on 900 servers for 1 month and then turn them off, no planning, no 3 year capital outlay, no assembly, burn in, software configuration, hardware repair, etc etc, I'd call you crazy. Hosting providers existed but nobody could just give you 900 servers in an hour, nobody had that capacity.
And by the way: cloud prices are retail prices. Get on a savings plan or reserve some instances and the cost can be half. Spot instances are a quarter or less the price. Serverless is pennies on the dollar with no management overhead.
If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you, as it can take up to several days for some cloud vendors to get some hardware classes in some regions. And I pray you were doing daily disk snapshots, and can get your dead disks replaced quickly.
The thing that confuses me is, isn't every publicly accessible service bursty on a long timescale? Everything looks seasonal and predictable until you hit the front page of Reddit, and you don't know what day that will be. You don't decide how much traffic you get, the world does.
> I have been doing this for two decades. Let me tell you about bare metal.
> Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.
> We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.
I started working with real (bare metal) servers on real internet loads in 2004 and retired in 2019. While there's truth here, there's also missing information. In 2004, all my servers had 100M ethernet, but in 2019, all my new servers had 4x10G ethernet (2x public, 2x private), actually some of them had 6x, but with 2x unconnected, I dunno why. In the meantime, cpu, nics, and operating systems have improved such that if you're not getting line rate for full mtu packets, it's probably becsause your application uses a lot of cpu, or you've hit a pathological case in the OS (which happens, but if you're running 1000 servers, you've probably got someone to debug that).
If you still need 1000 beefy 10G servers, you've got a pretty formidable load, but splitting it up into many more smaller servers is asking for problems of different kinds. Otoh, if your load really scales to 10x for a month, and you're at that scale, cloud economics are going to work for you.
My seasonal loads were maybe 50% more than normal, but usage trends (and development trends) meant that the seasonal peak would become the new normal soon enough; cloud managing the peaks would help a bit, but buying for the peak and keeping it running for the growth was fine. Daily peaks were maybe 2-3x the off-peak usage, 5 or 6 days a week; a tightly managed cloud provisioning could reduce costs here, but probably not enough to compete with having bare metal for the full day.
Let me take you back to March, 2020. When millions of Americans woke up to find out there was a pandemic and they would be working from home now. Not a problem, I'll just call up our cloud provider and request more cloud compute. You join a queue of a thousand other customers calling in that morning for the exact same thing. A few hours on hold and the CSR tells you they aren't provisioning anymore compute resources. east-us is tapped out, central-europe tapped out hours ago, California got a clue and they already called to reserve so you can't have that either.
I use cloud all the time but there are also blackswan events where your IaaS can't do anymore for you.
I never had this problem on AWS though I did see some startups struggle with some more specialized instances. Are midsize companies actually running into issues with non-specialized compute on AWS?
That's a good point about cloud services being retail. My company gets a very large discount from one of the most well-known cloud providers. This is available to everybody - typically if you commit to 12 months of a minimum usage then you can get substantial discounts. What I know is so far everything we've migrated to the cloud has resulted in significantly reduced total costs, increased reliability, improved scalability, and is easier to enhance and remediate. Faster, cheaper, better - that's been a huge win for us!
The entire point of the article is that your dated example no longer applies: you can fit the vast majority of common loads on a single server now, they are this powerful.
Redundancy concerns are also addressed in the article.
> If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you
You are taking this a bit too literally. The article itself says one server (and backups).
So "one" here just means a small number not literally no fallback/backup etc. (obviously... even people you disagree with are usually not morons)
> If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you
There's intermediate ground here. Rent one big server, reserved instance. Cloudy in the sense that you get the benefits of the cloud provider's infrastructure skills and experience, and uptime, plus easy backup provisioning; non-cloudy in that you can just treat that one server instance like your own hardware, running (more or less) your own preferred OS/distro, with "traditional" services running on it (e.g. in our case: nginx, gitea, discourse, mantis, ssh)
i handled a 8x increase in traffic to my website from a youtuber reviewing our game, by increasing the cache timer and fixing the wiki creating session table entries for logged out users on a wiki that required accounts to edit it.
we already get multiple millions of page hits a months for this happened.
This server had 8 cores but 5 of them were reserved for the 10tb a month in bandwidth game servers running on the same machine.
If you needed 1,000 physical computers to run your webapp, you fucked up somewhere along the line.
I didn't want to write a top-level comment and I'm sure few people will see this, but I scrolled down very far in this thread and didn't see this point made anywhere:
The article focuses almost entirely on technical questions, but the technical considerations are secondary; the reason so many organizations prefer cloud services, VMs, and containers is to manage the challenges of scaling organizationally, not technically.
Giving every team the tools necessary to spin up small or experimental services greases the skids of a large or quickly growing organization. It's possible to set this up on rented servers, but it's an up front cost in time.
The article makes perfect sense for a mature public facing service with a lot of predictable usage, but the sweet spot for cloud services is sprawling organizations with lots of different teams doing lots of different mostly-internally facing things.
I agree with almost everything you said; except that the article offers extremely valuable advice for small startups going the cloud / rented VM route: Yearly payments, or approaching a salesperson, can lead to much lower costs.
(I should point out that yesterday, in Azure, I added a VM in a matter of seconds and it took all of 15 minutes to boot up and start running our code. My employer is far too small to have dedicated ops; the cost of cloud VMs is much cheaper than hiring another ops / devops / whatever.)
Yep. To be clear, I thought it was a great article with lots of great advice, just too focused on the technical aspects of cloud benefits, whereas I think the real value is organizational.
Interesting write-up that acknowledges the benefits of cloud computing while starkly demonstrating the value proposition of just one powerful, on-prem server. If it's accurate, I think a lot of people are underestimating the mark-up cloud providers charge for their services.
I think one of the major issues I have with moving to the cloud is a loss of sysadmin knowledge. The more locked in you become to the cloud, the more that knowledge atrophies within your organization. Which might be worth it to be nimble, but it's a vulnerability.
I like One Big (virtual) Server until you come to software updates. At a current project we have one server running the website in production. It runs an old version of Centos, the web server, MySQL and Elasticsearch all on the one machine.
No network RTTs when doing too many MySQL queries on each page - great! But when you want to upgrade one part of that stack... we end up cloning the server, upgrading it, testing everything, and then repeating the upgrade in-place on the production server.
I don't like that. I'd far rather have separate web, DB and Elasticsearch servers where each can be upgraded without fear of impacting the other services.
You could just run system containers (eg. lxd) for each component, but still on one server. That gets you multiple "servers" for the purposes of upgrades, but without the rest of the paradigm shift that Docker requires.
Which is great until there's a security vuln in an end-of-life piece of core software (the distro, the kernel, lxc, etc) and you need to upgrade the whole thing, and then it's a 4+ week slog of building a new server, testing the new software, fixing bugs, moving the apps, finding out you missed some stuff and moving that stuff, shutting down the old one. Better to occasionally upgrade/reinstall the whole thing with a script and get used to not making one-off changes on servers.
If I were to buy one big server, it would be as a hypervisor. Run Xen or something and that way I can spin up and down VMs as I choose, LVM+XFS for snapshots, logical disk management, RAID, etc. But at that point you're just becoming a personal cloud provider; might as well buy smaller VMs from the cloud with a savings plan, never have to deal with hardware, make complex changes with a single API call. Resizing an instance is one (maybe two?) API call. Or snapshot, create new instance, delete old instance: 3 API calls. Frickin' magic.
"the EC2 Instance Savings Plans offer up to 72% savings compared to On-Demand pricing on your Amazon EC2 Instances" - https://aws.amazon.com/savingsplans/
I use LXC a lot for our relatively small production setup. And yes, I'm treating the servers like pets, not cattle.
What's nice is that I can snapshot a container and move it to another physical machine. Handy for (manual) load balancing and upgrades to the physical infrastructure. It is also easy to run a snapshot of the entire server and then run an upgrade, then if the upgrade fails, you roll back to the old snapshot.
Doesn't the container help with versioning the software inside it, but you're still tied to the host computer's operating system, and so when you upgrade that you have to test every single container to see if anything broke?
Whereas if running a VM you have a lot more OS upgrades to do, but you can do them individually and they have no other impact?
This is the bit I've never understood with containers...
In the paper on Twitter’s “Who to Follow” service they mention that they designed the service around storing the entire twitter graph in the memory of a single node:
> An interesting design decision we made early in the Wtf project was to assume in-memory processing on a single server. At first, this may seem like an odd choice, run- ning counter to the prevailing wisdom of “scaling out” on cheap, commodity clusters instead of “scaling up” with more cores and more memory. This decision was driven by two rationales: first, because the alternative (a partitioned, dis- tributed graph processing engine) is significantly more com- plex and dicult to build, and, second, because we could! We elaborate on these two arguments below.
> Requiring the Twitter graph to reside completely in mem- ory is in line with the design of other high-performance web services that have high-throughput, low-latency require- ments. For example, it is well-known that Google’s web indexes are served from memory; database-backed services such as Twitter and Facebook require prodigious amounts of cache servers to operate smoothly, routinely achieving cache hit rates well above 99% and thus only occasionally require disk access to perform common operations. However, the additional limitation that the graph fits in memory on a single machine might seem excessively restrictive.
I always wondered if they still do this and if this influenced any other architectures at other companies.
Yeah I think single machine has its place, and I once sped up a program by 10000x by just converting it to Cython and having it all fit in the CPU cache, but the cloud still does have a place! Even for non-bursty loads. Even for loads that theoretically could fit in a single big server.
Uptime.
Or are you going to go down as all your workers finish? Long connections? Etc.
It is way easier to gradually handover across multiple API servers as you do an upgrade than it is to figure out what to do with a single beefy machine.
I'm not saying it is always worth it, but I don't even think about the API servers when a deploy happens anymore.
Furthermore if you build your whole stack this way it will be non-distributed by default code. Easy to transition for some things, hell for others. Some access patterns or algorithms are fine when everything is in a CPU cache or memory but would fall over completely across multiple machines. Part of the nice part about starting with cloud first is that it is generally easier to scale to billions of people afterwards.
That said, I think the original article makes a nuanced case with several great points and I think your highlighting of the Twitter example is a good showcase for where single machine makes sense.
I have gone well beyond this figure by doing clever tricks in software and batching multiple transactions into IO blocks where feasible. If your average transaction is substantially smaller than the IO block size, then you are probably leaving a lot of throughput on the table.
The point I am trying to make is that even if you think "One Big Server" might have issues down the road, there are always some optimizations that can be made. Have some faith in the vertical.
This path has worked out really well for us over the last ~decade. New employees can pick things up much more quickly when you don't have to show them the equivalent of a nuclear reactor CAD drawing to get started.
> batching multiple transactions into IO blocks where feasible. If your average transaction is substantially smaller than the IO block size, then you are probably leaving a lot of throughput on the table.
Could you expand on this? A quick Google search didn't help. Link to an article or a brief explanation would be nice!
Sure. If you are using some micro-batched event processing abstraction, such as the LMAX Disruptor, you have an opportunity to take small batches of transactions and process them as a single unit to disk.
For event sourcing applications, multiple transactions can be coalesced into a single IO block & operation without much drama using this technique.
Surprisingly, this technique also lowers the amount of latency that any given user should experience, despite the fact that you are "blocking" multiple users to take advantage of small batching effects.
As per usual, don't copy Google if you don't have the same requirements. Google Search never goes down. HN goes down from time and nobody minds. Google serves tens (hundreds?) of thousands of queries per second. HN serves ten. HN is fine with one server because it's small. How big is your service going to be? Do that boring math :)
Correct. I like to ask "how much money do we lose if the site goes down for 1hr? a day?" etc.. and plan around that. If you are losing 1m an hour, or 50m if it goes down for a day, hell yeah you should spend a few million on making sure your site stays online!
But, it is amazing how often c-levels cannot answer this question!
I think Elixir/Erlang is uniquely positioned to get more traction in the inevitable microservice/kubernetes backlash and the return to single server deploys (with a hot backup). Not only does it usually sip server resources but it also scales naturally as more cores/threads are available on a server.
Going from an Erlang "monolith" to a java/k8s cluster, I was amazed at how much more work it is takes to build a "modern" microservice. Erlang still feels like the future to me.
While individual Node.js processes are single-threaded, Node.js includes a standard API that distributes its load across multiple processes, and therefor cores.
Don't be scared of 'one big server' for reliability. I'd bet that if you hired a big server today in a datacenter, the hardware will have more uptime than something cloud-native with az-failover hosted on AWS.
Just make sure you have a tested 30 minute restoration plan in case of permanent hardware failure. You'll probably only use it once every 50 years on average, but it will be an expensive event when it happens.
The way I code now after 10 years: Use one big file.
No executable I'm capable of writing on my own is complex enough to need 50 files spread across a 3-layers-deep directory tree. Doesn't matter if it's a backend, a UI, or what. There's no way your React or whatever tutorial example code needs that either. And you don't gain any meaningful organization splitting into files when there are already namespaces, classes, structs, comments, etc. I don't want to waste time reorganizing it, dealing with imports, or jumping around different files while I code.
Oh, there's some custom lib I want to share between executables, like a Postgres client? Fine, it gets its own new file. Maybe I end up with 4 files in the end.
This is sorta how our team does things, and so far it hasn't presented issues. Each service has the vast majority of its real logic in a single file. Worst case, one day this stops working, and someone takes 10 minutes to split things into a separate file.
On the other side, I've seen people spend hours preemptively deciding on a file structure. It often stops making sense a month later, and every code review has a back and forth argument about what to name a new file.
Reminds me of a company I used to work at which took a similar approach. We used one file per person policy, each developer had their own file that contained functionality developed by them, named like firstName_lastName.ext - everyone owned their file so we didn't have to worry about merge conflicts.
On the team at my day job, it'd be very bad for each person to strictly "own" their code like that because things get handed off all the time, but in some other situations I can see it making sense.
There are some Firebase specific annoyances to put up with, like the local emulator is not as nice and "isomorphic" as say running postgresql locally.
But the main problem (and I think this is shared by what I call loosely "distributed databases") is you have to think really hard about how the data is structured.
You can't structure it as nicely from a logical perspective compared to a relational DB. Because you can't join without pulling data from all over the place. Because the data isn't in one place. It is hard to do joins both in terms of performance and in terms of developer ergonomics.
I really miss SELECT A.X, B.Y FROM A JOIN B ON A.ID = B.AID; when using Firebase.
You have to make data storage decisions early on, and it is hard to change you mind later. It is hard to migrate (and may be expensive if you have a lot of existing data).
I picked Firebase for the wrong reason (I thought it would make MVP quicker to set up). But the conveniences it provides are outweighed by having to structure your data for distribution across servers.
Instead next time I would go relational, then when I hit a problem do that bit distributed. Most tables have 1000s of records. Maybe millions. The table with billions might need to go out to something distributed.
Market gap??:
Let me rent real servers, but expose it in a "serverless" "cloud-like" way, so I don't have to upgrade the OS and all that kind of stuff.
In my opinion the best argument for RDBMSs came, ironically, from Rick Houlihan, who was at that time devrel for DynamoDB. Paraphrasing from memory, he said "most data is relational, because relationships are what give data meaning, but relational databases don't scale."
Which, maybe if you're Amazon, RDBMSs don't scale. But for a pleb like me, I've never worked on a system even close the scaling limits of an RDBMS—Not even within an order of magnitude of what a beefy server can do.
DynamoDB, Firebase, etc. require me to denormalize data, shape it to conform to my access patterns—And pray that the access patterns don't change.
No. I think I'll take normalized data in an RDBMS, scaling be damned.
> Let me rent real servers, but expose it in a "serverless" "cloud-like" way, so I don't have to upgrade the OS and all that kind of stuff.
I think you're describing platform-as-a-service? It does exist, but it didn't eat cloud's lunch, rather the opposite I expect.
It's hard to sell a different service when most technical people in medium-big companies are at the mercy of non-technical people who just want things to be as normal as possible. I recently encountered this problem where even using Kubernetes wasn't enough, we had to use one of the big three, even though even sustained outages wouldn't be very harmful to our business model. What can I say, boss want cloud.
Yes, it's very hard to beat Postgres IMO. You can use Firebase without using its database, and you can certainly run a service with a Postgres database without having to rent out physical servers.
At various points in my career, I worked on Very Big Machines and on Swarms Of Tiny Machines (relative to the technology of their respective times). Both kind of sucked. Different reasons, but sucked nonetheless. I've come to believe that the best approach is generally somewhere in the middle - enough servers to ensure a sufficient level of protection against failure, but no more to minimize coordination costs and data movement. Even then there are exceptions. The key is don't run blindly toward the extremes. Your utility function is probably bell shaped, so you need to build at least a rudimentary model to explore the problem space and find the right balance.
1) you need to get over the hump and build in multiple servers into your architecture from the get go (the author says you need two servers minimum), so really we are talking about two big servers.
2) having multiple small servers allows us to spread our service into different availability zones
3) multiple small servers allows us to do rolling deploys without bringing down our entire service
4) once we use the multiple small servers approach it’s easy to scale up and down our compute by adding or removing machines. Having one server it’s difficult to scale up or down without buying more machines. Small servers we can add incrementally but with the large server approach scaling up requires downtime and buying a new server.
The line of thinking you follow is what is plaguing this industry with too much complexity and simultaneously throwing away incredible CPU and PCIe performance gains in favor of using the network.
Any technical decisions about how many instances to have and how they should be spread out needs to start as a business decision and end in crisp numbers about recovery point/time objections, and yet somehow that nearly never happens.
To answer your points:
1) Not necessarily. You can stream data backups to remote storage and recover from that on a new single server as long as that recovery fits your Recovery Time Objective (RTO).
2) What's the benefit of multiple AZs if the SLA of a single AZ is greater than your intended availability goals? (Have you checked your provider's single AZ SLA?)
3) You can absolutely do rolling deploys on a single server.
4) Using one large server doesn't mean you can't compliment it with smaller servers on an as-needed basis. AWS even has a service for doing this.
Which is to say: there aren't any prescriptions when it comes to such decisions. Some businesses warrant your choices, the vast majority do not.
> Any technical decisions about how many instances to have and how they should be spread out needs to start as a business decision and end in crisp numbers about recovery point/time objections, and yet somehow that nearly never happens.
Nobody wants to admit that their business or their department actually has a SLA of "as soon as you can, maybe tomorrow, as long as it usually works". So everything is pretend-engineered to be fifteen nines of reliability (when in reality it sometimes explodes because of the "attempts" to make it robust).
Being honest about the actual requirements can be extremely helpful.
> simultaneously throwing away incredible CPU and PCIe performance gains
We really need to double down on this point. I worry that some developers believe they can defeat the laws of physics with clever protocols.
The amount of time it takes to round trip the network in the same datacenter is roughly 100,000 to 1,000,000 nanoseconds.
The amount of time it takes to round trip L1 cache is around half a nanosecond.
A trip down PCIe isn't much worse, relatively speaking. Maybe hundreds of nanoseconds.
Lots of assumptions and hand waving here, but L1 cache can be around 1,000,000x faster than going across the network. SIX orders of magnitude of performance are instantly sacrificed to the gods of basic physics the moment you decide to spread that SQLite instance across US-EAST-1. Sure, it might not wind up a million times slower on a relative basis, but you'll never get access to those zeroes again.
> 2) What's the benefit of multiple AZs if the SLA of a single AZ is greater than your intended availability goals? (Have you checked your provider's single AZ SLA?)
… my providers single AZ SLA is less than my company's intended availability goals.
(IMO our goals are also nuts, too, but it is what it is.)
Our provider, in the worse case (a VM using a managed hard disk) has an SLA of 95% within a month (I … think. Their SLA page uses incorrect units on the top line items. The examples in the legalese — examples are normative, right? — use a unit of % / mo…).
You're also assuming a provider a.) typically meets their SLAs and b.) if they don't, honors them. IME, (a) is highly service dependent, with some services being just stellar at it, and (b) is usually "they will if you can prove to them with your own metrics they had an outage, and push for a credit. Also (c.) the service doesn't fail in a way that's impactful, but not covered by SLA. (E.g., I had a cloud provider once whose SLA was over "the APIs should return 2xx", and the APIs during the outage, always returned "2xx, I'm processing your request". You then polled the API and got "2xx your request is pending". Nothing was happening, because they were having an outage, but that outage could continue indefinitely without impacting the SLA! That was a fun support call…)
There's also (d) AZs are a myth; I've seen multiple global outages. E.g., when something like the global authentication service falls over and takes basically every other service with it. (Because nothing can authenticate. What's even better is the provider then listing those services as "up" / not in an outage, because technically it's not that service that's down, it is just the authentication service. Cause God forbid you'd have to give out that credit. But the provider calling a service "up" that is failing 100% of the requests sent its way is just rich, from the customer's view.)
I agree! Our "distributed cloud database" just went down last night for a couple of HOURS. Well, not entirely down. But there were connection issues for hours.
Guess what never, never had this issue? The hardware I keep in a datacenter lol!
> The line of thinking you follow is what is plaguing this industry with too much complexity and simultaneously throwing away incredible CPU and PCIe performance gains in favor of using the network.
It will die out naturally once people realize how much the times have changed and that the old solutions based on weaker hardware are no longer optimal.
"It depends" is the correct answer to the question, but the least informative.
One Big Server or multiple small servers? It depends.
It always depends. There are many workloads where one big server is the perfect size. There are many workloads where many small servers are the perfect solution.
What my point is, is that the ideas put forward in the article are flawed for the vast majority of use cases.
I'm saying that multiple small servers are a better solution on a number of different axis.
For
1) "One Server (Plus a Backup) is Usually Plenty"
Now I need some kind of remote storage streaming system and some kind of manual recovery, am I going to fail over to the backup (and so it needs to be as big as my "One server" or will I need to manually recover from my backup?
2) Yes it depends on your availability goals, but you get this as a side effect of having more than one small instance
3) Maybe I was ambiguous here. I don't just mean rolling deploys of code. I also mean changing the server code, restarting, upgrading and changing out the server. What happens when you migrate to a new server (when you scale up by purchasing a different box). Now we have a manual process that doesn't get executed very often and is bound to cause downtime.
4) Now we have "Use one Big Server - and a bunch of small ones"
I'm going to add a final point on reliability. By far the biggest risk factor for reliability is me the engineer. I'm responsible for bringing down my own infra way more than any software bug or hardware issue. The probability of me messing up everything when there is one server that everything depends on is much much higher, speaking from experience.
So. Like I said, I could have said "It depends" but instead I tried to give a response that was someway illuminating and helpful, especially given the strong opinions expressed in the article.
I'll give a little color with the current setup for a site I run.
moustachecoffeeclub.com runs on ECS
I have 2 on-demand instances and 3 spot instances
One tiny instance running my caches (redis, memcache)
One "permanent" small instance running my web server
Two small spot instances running web server
One small spot instance running background jobs
small being about 3 GB and 1024 CPU units
And an RDS instance with backup about $67 / month
All in I'm well under $200 per month including database.
So you can do multiple small servers inexpensively.
Another aspect is that I appreciate being able to go on vacation for a couple of weeks, go camping or take a plane flight without worrying if my one server is going to fall over when I'm away and my site is going to be down for a week. In a big company maybe there is someone paid to monitor this, but with a small company I could come back to a smoking hulk of a company and that wouldn't be fun.
> you need to get over the hump and build in multiple servers into your architecture from the get go (the author says you need two servers minimum), so really we are talking about two big servers.
Managing a handful of big servers can be done manually if needed - it's not pretty but it works and people have been doing it just fine before the cloud came along. If you intentionally plan on having dozens/hundreds of small servers, manual management becomes unsustainable and now you need a control plane such as Kubernetes, and all the complexity and failure modes it brings.
> having multiple small servers allows us to spread our service into different availability zones
So will 2 big servers in different AZs (whether cloud AZs or old-school hosting providers such as OVH).
> multiple small servers allows us to do rolling deploys without bringing down our entire service
Nothing prevents you from starting multiple instances of your app on one big server nor doing rolling deploys with big bare-metal assuming one server can handle the peak load (so you take out your first server out of the LB, upgrade it, put it back in the LB, then do the same for the second and so on).
> once we use the multiple small servers approach it’s easy to scale up and down our compute by adding or removing machines. Having one server it’s difficult to scale up or down without buying more machines. Small servers we can add incrementally but with the large server approach scaling up requires downtime and buying a new server.
True but the cost premium of the cloud often offsets the savings of autoscaling. A bare-metal capable of handling peak load is often cheaper than your autoscaling stack at low load, therefore you can just overprovision to always meet peak load and still come out ahead.
I manage hundreds of servers, and use Ansible. It's simple and it gets the job done. I tried to install Kubernetes on a cluster and couldn't get it to work. I mean I know it works, obviously, but I could not figure it out and decided to stay with what works for me.
On a big server, you would probably be running VMs rather than serving directly. And then it becomes easy to do most of what you're talking about - the big server is just a pool of resources from which to make small, single purpose VMs as you need them.
It completely depends on what you doing. This was pointed out in the first paragraph of the article:
> By thinking about the real operational considerations of our systems, we can get some insight into whether we actually need distributed systems for most things.
I'm building an app with Cloudflare serverless and you can emulate everything locally with a single command and debug directly... It's pretty amazing.
But the way their offerings are structured means it will be quite expensive to run at scale without a multi cloud setup. You can't globally cache the results of a worker function in CDN, so any call to a semi dynamic endpoint incurs one paid invocation, and there's no mechanism to bypass this via CDN caching because the workers live in front of the CDN, not behind it.
Despite their media towards lowering cloud costs, they have explicitly designed their products to contain people in a cost structure similar to but different than via egress fees. And in fact it's quite easily bypassed by using a non Cloudflare CDN in front of Cloudflare serverless.
Anyway, I reached a similar conclusion that for my app a single large server instance works best. And actually I can fit my whole dataset in RAM, so disk/JSON storage and load on startup is even simpler than trying to use multiple systems and databases.
Further, can run this on a laptop for effectively free, and cache everything via CDN, rather than pay ~$100/month for a cloud instance.
When you're small, development time is going to be your biggest constraint, and I highly advocate all new projects start with a monolithic approach, though with a structure that's conducive to decoupling later.
As someone who has only dabbled with serverless (Azure functions), the difficulty in setting up a local dev environment was something I found really off-putting. There is no way I am hooking up my credit card to test something that is still in development. It just seems crazy to me. Glad to hear Cloudflare workers provides a better experience. Does it provide any support for mocking commonly used services?
Yes, you can run your entire serverless infrastructure locally with a single command and close to 0 config.
It's far superior to other cloud offerings in that respect.
You can even run it live in dev mode and remote debug the code. Check out miniflare/Wrangler v2
Just wish they would have ability for persistent objects. Everything is still request driven, yet I want to schedule things on subminute schedules. You can do it today, but it requires hacks
Yes, but the worker is in front of the cache (have to pay for an invocation even if cached), and the worker only interacts with the closest cache edge node, not the entire CDN.
But yeah, there are a few hacky ways to work around things. You could have two different URLs and have the client check if the item is stale, if so, call the worker which updates it.
I'm doing something similar with durable objects. I can get it to be persistent by having a cron that calls it every minute and then setting an alarm loop within the object.
It's just super awkward. It feels like a design decision to drive monetization. Cloudflare would be perfect if they let you have a persistent durable object instance that could update global CDN content
It's still the best serverless dev experience for me. Can do everything via JS while having transactional guarantees and globally distributed data right at the edge
One of first experiences in my professional career was situation when "one big server" that was serving the system that was making money actually failed on Friday, HP's warranty was like next or 2 business days to get a replacement.
The entire situation ended up having conference call with multiple department directors who were deciding which server from other systems to cannibalize (even if it is underpowered) to get the system going.
Since that time I'm quite skeptical about "one", and to me this is one of big benefits of cloud provides, as, most likely, there is another instance and stockouts are more rare.
Science advances as RAM on a single machine increases.
For many years, genomics software was non-parallel and depending on having a lot of RAM- often a terabyte or more- to store data in big hash tables. Converting that to distributed computing was a major effort and to this day many people still just get a Big Server With Lots of Cores, RAM, and SSD.
Personally after many years of working wiht distributed, I absolutely enjoy working on a big fat server that I have all to myself.
On the other hand in science, it sure is annoying that the size of problems that fit in a single node is always increasing. PARDISO running on a single node will always be nipping at your heels if you are designing a distributed linear system solver...
As someone who's worked in cloud sales and no longer has any skin in the game, I've seen firsthand how cloud native architectures improve developer velocity, offer enhanced reliability and availability, and actually decrease lock-in over time.
Every customer I worked with who had one of these huge servers introduced coupling and state in some unpleasant way. They were locked in to persisted state, and couldn't scale out to handle variable load even if they wanted to. Beyond that, hardware utilization became contentious at any mid-enterprise scale. Everyone views the resource pool as theirs, and organizational initiatives often push people towards consuming the same types of resources.
When it came time to scale out or do international expansion, every single one of my customers who had adopted this strategy had assumptions baked into their access patterns that made sense given their single server. When it came time to store some part of the state in a way that made sense for geographically distributed consumers, it was months not sprints of time spent figuring out how to hammer this in to a model that's fundamentally at odds.
From a reliability and availability standpoint, I'd often see customers tell me that 'we're highly available within a single data center' or 'we're split across X data centers' without considering the shared failure modes that each of these data centers had. Would a fiber outage knock out both of your DCs? Would a natural disaster likely knock something over? How about _power grids_? People often don't realize the failure modes they've already accepted.
This is obviously not true for every workload. It's tech, there are tradeoffs you're making. But I would strongly caution any company that expects large growth against sitting on a single-server model for very long.
Could confirmation bias affect your analysis at all?
How many companies went cloud-first and then ran out of money? You wouldn't necessary know anything about them.
Were the scaling problems your single-server customers called you to solve unpleasant enough put their core business in danger? Or was the expense just a rounding error for them?
From this and the other comment, it looks like I wasn't clear about talking about SMB/ME rather than a seed/pre-seed startup, which I understand can be confusing given that we're on HN.
I can tell you that I've never seen a company run out of money from going cloud-first (sample size of over 200 that I worked with directly). I did see multiple businesses scale down their consumption to near-zero and ride out the pandemic.
The answer to scaling problems being unpleasant enough to put the business in danger is yes, but that was also during the pandemic when companies needed to make pivots to slightly different markets. Doing this was often unaffordable from an implementation cost perspective at the time when it had to happen. I've seen acquisitions fall through due to an inability to meet technical requirements because of stateful monstrosities. I've also seen top-line revenue get severely impacted when resource contention causes outages.
The only times I've seen 'cloud-native' truly backfire were when companies didn't have the technical experience to move forward with these initiatives in-house. There are a lot of partners in the cloud implementation ecosystem who will fleece you for everything you have. One such example was a k8s microservices shop with a single contract developer managing the infra and a partner doing the heavy lifting. The partner gave them the spiel on how cloud-native provides flexibility and allows for reduced opex and the customer was very into it. They stored images in a RDBMS. Their database costs were almost 10% of the company's operating expenses by the time the customer noticed that something was wrong.
The common element in the above is scaling and reliability. While lots of startups and companies are focused on the 1% chance that they are the next Google or Shopify, the reality is that nearly all aren't, and the overengineering and redundancy-first model that cloud pushes does cost them a lot of runway.
It's even less useful for large companies; there is no world in which Kellogg is going to increase sales by 100x, or even 10x.
But most companies aren't startups. Many companies are established, growing businesses with a need to be able to easily implement new initiatives and products.
The benefits of cloud for LE are completely different. I'm happy to break down why, but I addressed the smb and mid-enterprise space here because most large enterprises already know they shouldn't run on a single rack.
Wound up spawning off a separate thread from our would-be stateless web api to run recurring bulk processing jobs.
Then coupled our web api to the global singleton-esque bulk processing jobs thread in a stateful manner.
The wrapped actors up on actors on top of everything to try to wring as much performance as possible out of the big server.
Then decided they wanted to have a failover/backup server but it was too difficult due to the coupling to the global singleton-esque bulk processing job.
[I resigned at this point.]
So yeah color me skeptical. I know every project's needs are different, but I'm a huge fan of dumping my code into some cloud host that auto-scaled horizontally, and then getting back to writing more code that provides some freeeking busines value.
If you are at all cost sensitive, you should have some of your own infrastructure, some rented, and some cloud.
You should design your stuff to be relatively easily moved and scaled between these. Build with docker and kubernetes and that's pretty easy to do.
As your company grows, the infrastructure team can schedule which jobs run where, and get more computation done for less money than just running everything in AWS, and without the scaling headaches of on-site stuff.
This post raises small issues like reliability, but missed lot of much bigger issues like testing, upgrades, reproducibility, backups and even deployments. Also, the author is comparing on demand pricing, which to me doesn't make sense if you are paying for the server with reserved pricing. Still I agree there would be a difference of 2-3x(unless your price is dominated by AWS egress fees), but most server with fixed workload, even for very popular but simple sites, it could be done in $1k/month in cloud, less than 10% of one developer salary. For non fixed workload like ML training, you would anyways need some cloudy setup.
One thing that has helped me grow over the last few years building startups is: microservices software architecture and microservice deployment are two different things.
You can logically break down your software into DDD bounded contexts and have each own its data, but that doesn't mean you need to do Kubernetes with Kafka and dozens of tiny database instances, communicating via json/grpc. You can have each "service" live in its own thread/process, have it's own database (in the "CREATE DATABASE" sense, not the instance sense), communicate via a simple in-memory message queue, and communicate through "interfaces" native to your programming language.
Of course it has its disadvantages (need to commit to a single softare stack, still might need a distributed message queue if you want load balancing, etc) but for the "boring business applications" I've been implementing (where DDD/logical microservices makes sense) it has been very useful.
I didn’t see a point of cloudy services being easier to manage. If some team gets a capital budget to buy that one big server, they will put every thing on it, no matter your architectural standards. Cron jobs editing state on disk, tmux sessions shared between teams, random web servers doing who knows what, non-DBA team Postgres installs, etc. at least in cloud you can limit certain features and do charge back calculations.
Not sure if that is a net win for cloud or physical, of course, but I think it is a factor
One of our projects uses 1 big server and indeed, everyone started putting everything on it (because it's powerful): the project itself, a bunch of corporate sites, a code review tool, and god knows what else. Last week we started having issues with the projects going down because something is overloading the system and they still can't find out what exactly without stopping services/moving them to a different machine (fortunately, it's internal corporate stuff, not user-facing systems). The main problem I've found with this setup is that random stuff can accumulate with time and then one tool/process/project/service going out of control can bring down the whole machine. If it's N small machines, there's greater isolation.
I believe that the "one big server" is intended for an application rather than trying to run 500 applications.
Does your application run on a single server? If yes. Don't use a distributed system for it's architecture or design. Simply buy bigger hardware when necessary. Because the top end of servers are insanely big and fast.
It does not mean, IMHO, throw everything on a single system without suitable organization, oversight, isolation, and recovery plans.
I don't agree with EVERYTHING in the article such as getting 2 big rather than multiple smaller, this is really just a cost/requirement issue though.
The biggest cost I've noticed with enterprises who go full cloud is that they are locked in for the long term. I don't mean contractually though, basically the way they design and implement any system or service MUST follow the providers "way" this can be very detrimental for leaving the provider or god forbid the provider decides to sunset certain service versions etc.
That said, for enterprise it can make a lot of sense and the article covers it well by admitting some "clouds" are beneficial.
For anything I've ever done outside of large businesses the go to has always been "if it doesn't require a SRE to maintain, just host your own".
> Why Should I Pay for Peak Load? [...] someone in that supply chain is charging you based on their peak load
Oh it's even worse than that: this someone oversubscribe your hardware a little during your peak and a lot during your trough, padding their great margins at the expense of extra cache misses/perf degradation of your software that most of the time you won't notice if they do their job well.
This is one of the reasons why large companies such as my employer (Netflix) are able to invest into their own compute platforms to reclaim some of these gains back, so that any oversubscription & collocation gains materialize into a lower cloud bill - instead of having your spare CPU cycles be funneled to a random co-tenant customer of your cloud provider, the latter capturing the extra value.
A consequence of one-big-server is decreased security. You become discouraged from applying patches because you must reboot. Also if one part of the system is compromised, every service is now compromised.
Microservices on distinct systems offer damage control.
> In comparison, buying servers takes about 8 months to break even compared to using cloud servers, and 30 months to break even compared to renting.
Can anyone help me understand why the cloud/renting is still this expensive? I'm not familiar with this area, but it seems to me that big data centers must have some pretty big cost-saving advantages (maintenance? heat management?). And there are several major providers all competing in a thriving marketplace, so I would expect that to drive the cost down. How can it still be so much cheaper to run your own on-prem server?
- The price for on-prem conveniently omits costs for power, cooling, networking, insurance and building space, it's only the purchase price.
- The price for the cloud server includes (your share of) the costs of replacing a broken power supply or hard drive, which is not included in the list price for on-prem. You will have to make sure enough of your devs know how to do that or else hire a few sysadmin types.
- As the article already mentions, the cloud has to provision for peak usage instead of average usage. If you buy an on-prem server you always have the same amount of computing power available and can't scale up quickly if you need 5x the capacity because of a big event. That kind of flexibility costs money.
Not included in the break even calculation was the cost of colocation, or the cost of hiring someone to make sure the computer is in working order, or the less hassle upon hardware failures.
Also, as the author even mention in an article, a modern server basically obsoletes a 10 year old server. So you're going to have to replace your server at least every 10 years. So the break even in the case of renting makes sense when you consider that the server depreciates really quickly.
Renting is not very expensive. 30 months is a large share of a computer's lifetime, and you are paying for space, electricity, and internet access too.
You're paying a premium for flexibility. If you don't need that then there are far cheaper options like some managed hosting from your local datacenter.
I didn't see the COST paper linked anywhere in this thread [0].
Excerpt from abstract:
We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation.
Last year I did some consulting for a client using Google cloud services such as Spanner and cloud storage. Storing and indexing mostly timeseries data with a custom index for specific types of queries. It was difficult for them to define a schema to handle the write bandwidth needed for their ingestion. In particular it required a careful hashing scheme to balance load across shards of the various tables. (It seems to be a pattern with many databases to suck at append-often, read-very-often patterns, like logs).
We designed some custom in-memory data structures in Java but also also some of the standard high-performance concurrent data structures. Some reader/write locks. gRPC and some pub/sub to get updates on the order of a few hundred or thousand qps. In the end, we ended up with JVM instances that had memory requirements in the 10GB range. Replicate that 3-4x for failover, and we could serve queries at higher rates and lower latency than hitting Spanner. The main thing cloud was good for was the storage of the underlying timeseries data (600GB maybe?) for fast server startup, so that they could load the index off disk in less than a minute. We designed a custom binary disk format to make that blazingly fast, and then just threw binary files into a cloud filesystem.
If you need to serve < 100GB of data and most of it is static...IMHO, screw the cloud, use a big server and replicate it for fail-over. Unless you got really high write rates or have seriously stringent transactional requirements, then man, a couple servers will do it.
I find disk io to be a primary reason to go with bare metal. The vm abstractions just kill io performance. In a single server you can fill up the PCI lanes with flash and hit some ridiculous throughput numbers.
The former, mostly. You don't necessarily have to use EC2, but that's easy to do. There are many other, smaller providers if you really want to get out from under the big 3. I have no experience managing hardware, so I personally wouldn't take that on myself.
Currently using two old computers as servers in my homelab: 200 GE Athlons with 35 W TDP, ~20 GB of value RAM (can't afford ECC), a few 1TB HDDs. As CI servers and test nodes for running containers, they're pretty great, as well as nodes for pulling backups from any remote servers (apart from the ECC aspect), or even something to double as a NAS (on separate drives).
I actually did some quick maths and it would appear that a similar setup on AWS would cost over 600$ per month, Azure, GCP and others also being similarly expensive, which I just couldn't afford.
Currently running a few smaller VPSes on Time4VPS as well (though Hetzner is also great), for the stuff that needs better availability and better networking. Would I want everything on a single server? Probably not, because that would mean needing something a bit better than a homelab setup behind a residential Internet connection (even if parts of it can be exposed to the Internet through a cheap VPS as a proxy, a la Cloudflare).
One thing to keep in mind is separation. The prod environment should be completely separated from the dev ones (plural, it should be cheap/fast to spin up dev environments). Access to production data should be limited to those that need it (ideally for just the time they need it). Teams should be able to deploy their app separately and not have to share dependencies (i.e operating system libraries) and it should be possible to test OS upgrades (containers do not make you immune from this). It's kinda possible to sort of do this with 'one big server' but then you're running your own virtualized infrastructure which has it's own costs/pains.
Definitely also don't recommend one big database, as that becomes a hairball quickly - it's possible to have several logical databases for one physical 'database 'server' though.
people don't account for the cpu & wall-time cost of encode-decode. I've seen it take up 70% of cpu on a fleet. That means 700/1000 servers are just doing encode decode.
You can see high efficiency setups like stackexchange & hackernews are orders of magnitude more efficient.
Not to be nasty, but we used to call them mainframes. A mainframe is still a perfectly good solution if you need five nines of uptime, with transparent failover of pretty much every part of the machine, the absolute fastest single-thread performance and the most transaction throughput per million dollars in the market.
I would not advise anyone to run them as a single machine, however, but to have it partitioned into smaller slices (they call them LPARs) and host lots of VMs in there (you can oversubscribe like crazy on those machines).
Managing a single box is cheaper, even if you have a thousand little goldfish servers in there (remember: cattle, not pets) and this is something the article only touches lightly.
The author missed the most important factor why cloud is dominating the world today. It is never about the actual hardware cost. It is the cost of educating people be able to use that big server. I can guarantee you you will need to pay at least $40k a month to hire someone to be able to write and deploy softwares that can actually realize the performance he claims on that big server. And your chance to be able to find one in 2 month is closed to 0, at least in today’s job market. Also even if you find one , he can leave you in one year to some others places, and your business will be dead.
10 years ago I had a site running on an 8GB of ram VM ($80/mo?) that ran a site serving over 200K daily active users on a completely dynamic site written in PHP running MySQL locally. Super fast and never went down!
Could you share how long you maintained this website?
No problem with the db (schema updates, backups, replication, etc...)?
No problem with your app updates (downtime, dependencies updates, code updates)?
Did you work alone, or with a team?
Did you setup a CI/CD?
...
I wrote down some questions, but in fact I just think it would be interesting to understand what was your setup in a bit more detailed fashion. You probably made some concessions and it seems they worked well for you. Would be interesting to know which ones!
Yeah, I've been saying this for a long long time now, an early blog post of mine http://drupal4hu.com/node/305.html and this madness just got worse because of Kubernetes et al. Kubernetes is a Google solution. Are you sure Google-sized solutions are right for your organization?
Also, an equally pseudo controversial viewpoint: it's almost always cheaper to be down than engineering a HA architecture. Take a realistic look at downtime causes outside of your control -- for example, your DDoS shield provider going down etc. etc. and then consider how much downtime a hardware failure adds and now think. Maybe a manual failover master-slave is enough or perhaps even that's overkill? How much money does the business lose by being down versus how much it costs to protect from it? And can you really protect from it? Are you going to have regular drills to practice the failover -- and absurdly, will the inevitable downtime from failing a few of those be larger than a single server downtime? I rarely see posts about weighing these while the general advice of avoiding single points of failure -- which is very hard -- is abundant.
I'm a huge advocate of cloud services, and have been since 2007 (not sure where this guy got 2010 as the start of the "cloud revolution"). That out of the way, there is something to be said for starting off with a monolith on a single beefy server. You'll definitely iterate faster.
Where you'll get into trouble is if you get popular quickly. You may run into scaling issues early on, and then have to scramble to scale. It's just a tradeoff you have to consider when starting your project -- iterate quickly early and then scramble to scale, or start off more slowly but have a better ramping up story.
One other nitpick I had is that OP complains that even in the cloud you still have to pay for peak load, but while that's strictly true, it's amortized over so many customers that you really aren't paying for it unless you're very large. The more you take advantage of auto-scaling, the less of the peak load you're paying. The customers who aren't auto-scaling are the ones who are covering most of that cost.
You can run a pretty sizable business in the free tier on AWS and let everyone else subsidize your peak (and base!) costs.
It really depends on the service, how it is used, the shape of the data generated/consumed, what type of queries are needed, etc.
I've worked for a startup that hit scaling issues with ~50 customers. And have seen services with +million users on a single machine.
And what does "quickly" and "popular" even mean? It also depends a lot on the context. We need to start discussing about mental models for developers to think of scaling in a contextual way.
Sure but only if you architect it that way, which most people don't if they're using one big beefy server, because the whole reason they're doing that is to iterate quickly. It's hard to build something that can bust to the cloud while moving quickly.
Also, the biggest issue is where your data is. If you want to bust to the cloud, you'll probably need a copy of your data in the cloud. Now you aren't saving all that much money anymore and adding in architectural overhead. If you're going to bust to the cloud, you might as well just build in the cloud. :)
It was all good, until NUMA came, and now you have to careful rethought your process, or you get lots of performance issues in your (otherwise) well threaded code. Speaking from first-hand experience, when our level editor ended up being used by artists on a server class machine, and supposedly 4x faster machine was actually going 2x slower (why, lots of std::shared_ptr<> use on our side, or any atomic reference counting) caused slowdowns, as the cache (my understanding) had to be synchronized between the two physical CPUs each having 12 threads.
But really not the only issue, just pointing out - that you can't expect everything to scale smoothly there, unless well thought, like ask your OS to allocate your threads/memory only on one of the physical CPUS (and their threads), and somehow big disconnected part of your process(es) on the other one(s), and make sure the communication between them is minimal.. which actually wants micro-services design again at that level.
> The big drawback of using a single big server is availability. Your server is going to need downtime, and it is going to break. Running a primary and a backup server is usually enough, keeping them in different datacenters.
What about replication? I assume the 70k postgres IOPS fall to the floor when needing to replicate the primary database to a backup server in a different region.
Great article overall with many good points worth considering. Nothing is one size fits all so I won't get into the crux of the article: "just get one big server". I recently posted a comment breaking down the math for my situation:
It blows my mind people are spending $2000+ per month for a server they can get used for $4000-5000 one time only cost.
VMWare + Synology Business Backup + Synology C2 backup is our way of doing business and never failed us for over 7 years. Why do people spend so much money for cloud while they can host it themselves less than 5% of the cost? (2 year usage assumed).
They have been around forever and their $400 deal is good, but that is for 42U, 1G and only 15 amps. With beefier servers, you will need more current (both BW and amperage) if you intend on filling the rack.
The number of applications I have inherited that were messes falling apart at the seams because of misguided attempts to avoid "vendor lockin" with the cloud can not be understated. There is something I find ironic about people paying to use a platform but not using it because they feel like using it too much will make them feel compelled to stay there. Its basically starving yourself so you don't get too familiar with eating regularly.
Kids this PSA is for you. Auto Scaling Groups are just fine as are all the other "Cloud Native" services. Most business partners will tell you a dollar of growth is worth 5x-10x the value of a dollar of savings. Building a huge tall computer will be cheaper but if it isn't 10x cheaper(And that is Total Cost of Ownership not the cost of the metal) and you are moving more slowly than you otherwise would its almost a certainty you are leaving money on the table.
Aggressively avoiding lock-in is something I've never quite understood. Unless your provider of choice is also your competitor (like Spotify with Amazon) it shouldn't really be a problem. I'm not saying I'm a die hard cloud fan in all aspects but if you're going with it you may as well use it. Typically trying to avoid vendor lockin really ends up more expensive in the long run, you start avoiding the cheaper services (lambda for background job processing) for what may never really be a problem.
The one place I can see avoiding vendor lock-in as really useful is it often makes running things locally much easier. You're kind of screwed if you want to properly run something locally that uses SQS, DynamoDB, and Lambda. But that said, I think this is often better thought of as "keep my system simple" rather than "avoid vendor lock-in" as it focuses on the valuable side rather than the theoretical side.
The whole argument comes down to bursty vs. non-bursty workloads. What type of workloads make up the fat part of the distribution? If most use cases are bursty (which I would argue they are) then the author's argument only applies for specific applications. Therefore, most people do indeed see cost benefits from the cloud.
I really don't understand microservices for most businesses. They're great if you put the effort into it but most business don't have the scale required.
Big databases and big servers serve most businesses just fine. And past that NFS and other distributed filesystem approaches get you to the next phase by horizontally scaling your app servers without needing to decompose your business logic into microservices.
The best approach I've ever seen is a monorepo codebase with non-micro services built into it all running the same way across every app server with a big loadbalancer in front of it all.
No thanks. I have a few hobby sites, a personal vanity page, and some basic CPU expensive services that I use.
Moving to Aws server-less has saved me so much headache with system updates, certificate management, archival and backup, networking, and so much more. Not to mention with my low-but-spikey load, my breakeven is a long way off.
A big benefit is some providers will let you resize the VM bigger as you grow. The behind-the-scenes implementation is they migrate your VM to another machine with near-zero downtime. Pretty cool tech, and takes away a big disadvantage of bare metal which is growth pains.
I've started augmenting one big server with iCloud (CloudKit) storage, specifically syncing local Realm DBs to the user's own iCloud storage. Which means I can avoid taking custody of PII/problematic data, can include non-custodial privacy in product value/marketing, and means I can charge enough of a premium for the one big server to keep it affordable. I know how to scale servers in and out, so I feel the value of avoiding all that complexity. This is a business approach that leans into that, with a way to keep the business growing with domain complexity/scope/adoption (iCloud storage, probably other good APIs like this to work with along similar lines).
> Populated with specialized high-capacity DIMMs (which are generally slower than the smaller DIMMs), this server supports up to 8 TB of memory total.
At work we're building a measurement system for wind tunnel experiments, which should be able to sustain 500 MB/sec for minutes on end, preferably while simultaneously reading/writing from/to disk for data format conversion.
We bought a server with 1TB of RAM, but I wonder how much slower these high-capacity DIMMs are. Can anyone point me to information regarding latency and throughput? More RAM for disk caching might be something to look at.
I am using a semi big cloud VPS to host all my live services. It's 'just' a few thousand users per day over 10+ websites.
The combination of Postgres, Nginx and Passenger & Cloudflare make this a easy experience. The cloud (In this case Vultr) allows on demand scaling, backups and so far I've had zero downtime because of them.
In the past I've run a mixture of cloud and some dedicated servers and since migrating I have less downtime and way less work and no worse load times.
Being too cloudy without being too cloudy, as per the article, I've gone with a full stack in containers under Docker Compose one one EC2 server, including the database. Services are still logically separated and have a robust CI/CD set up but the cost is a 3rd of what an ECS set up with load balancers and RDS for the database would have been. It's also simpler. Have scripted the server set up, with regular back ups / snapshots but admit I would like db replication in there.
If you're hosting on-prem then you have a cluster to configure and manage, you have multiple data centers you need to provision, you need data backups you have to manage plus the storage required for all those backups. Data centers also require power, cooling, real estate taxes, administration - and you need at least two of them to handle systemic outages. Now you have to manage and coordinate your data between those data centers. None of this is impossible of course, companies have been doing this everyday for decades now. But let's not pretend it doesn't all have a cost - and unless your business is running a data center, none of these costs are aligned with your business' core mission.
If you're running a start-up it's pretty much a no-brainer you're going to start off in the cloud.
What's the real criteria to evaluate on-prem versus the cloud? Load consistency. As the article notes, serverless cloud architectures are perfect for bursty loads. If your traffic is highly variable then the ability to quickly scale-up and then scale-down will be of benefit to you - and there's a lot of complexity you don't have to manage to boot! Generally speaking such a solution is going to be cheaper and easier to configure and manage. That's a win-win!
If your load isn't as variable and you therefore have cloud resources always running, then it's almost always cheaper to host those applications on-prem - assuming you have on-prem hosting available to you. As I noted above, building data centers isn't cheap and it's almost always cheaper to stay in the cloud than it is to build a new data center, but if you already have data center(s) then your calculus is different.
Another thing to keep in mind at the moment is even if you decide to deploy on-prem you may not be able to get the hardware you need. A colleague of mine is working on a large project that's to be hosted on-prem. It's going to take 6-12 months to get all the required hardware. Even prior to the pandemic the backlog was 3-6 months because the major cloud providers are consuming all the hardware. Vendors would rather deal with buyers buying hardware by the tens of thousands than a shop buying a few dozen servers. You might even find your hardware delivery date getting pushed out as the "big guys" get their orders filled. It happens.
You know you can run a server in the cellar under your stairs.
You know that if you are a startup you can just keep servers in a closet and hope that no one turns on coffee machine while airco runs because it will pop circuit breakers, which will take down your server or maybe you might have UPS at least so maybe not :)
I have read horror stories about companies having such setups.
While they don't need multiple data centers, power, cooling and redundancy sounds for them like some kind of STD - getting cheap VPS should be default for such people. That is a win as well.
Many people will respond that "one big server" is a massive single point of failure, but in doing so they miss that it is also a single point of success. If you have a distributed system, you have to test and monitor lots of different failure scenarios. With a SPOS, you only have one thing to monitor. For a lot of cases the reliability of that SPOS is plenty.
Bonus: Just move it to the cloud, because AWS is definitely not its own SPOF and it never goes down taking half the internet with it.
"In total, this server has 128 cores with 256 simultaneous threads. With all of the cores working together, this server is capable of 4 TFLOPs of peak double precision computing performance. This server would sit at the top of the top500 supercomputer list in early 2000. It would take until 2007 for this server to leave the top500 list. Each CPU core is substantially more powerful than a single core from 10 years ago, and boasts a much wider computation pipeline."
I may be misunderstanding, but it looks like the micro-services comparison here is based on very high usage. Another use for micro-services, like lambda, is exactly the opposite. If you have very low usage, you aren't paying for cycles you don't use the way you would be if you either owned the machine, or rented it from AWS or DO and left it on all the time (which you'd have to do in order to serve that randomly-arriving one hit per day!)
If you have microservices that truly need to be separate services and have very little usage, you probably should use things like serverless computing. It scales down to 0 really well.
However, if you have a microservice with very little usage, turning that service into a library is probably a good idea.
Let's be clear here, everything you can do in a "cloudy" environment, you could do on big servers yourself - but at what engineering and human resource cost? Because that's something many - if not most - hardware and 'on-prem' infra focussed people seem to miss. While cloud might seem expensive, most of the times, humans will be even more expensive (unless you're in very niche markets like HPC)
You could also have those big servers in the cloud (I think this is what many are doing; I certainly have). That gives you a lot of the cloud services e.g. for monitoring, but you get to not have to scale horizontally or rebuild for serverless just yet. Works great for Kubernetes workloads, too – have a single super beefy node (i.e. single-node node pool) and target just your resource-heavy workload onto that node.
As far as costs are concerned, however, I've found that for medium+ sized orgs, cloud doesn't actually save money in the HR department, the HR spend just shifts to devops people, who tend to be expensive and you can't really leave those roles empty since then you'll likely get an ungovernable mess of unsecured resources that waste a huge ton of money and may expose you to GDPR fines and all sorts of nasty breaches.
If done right, you get a ton of execution speed. Engineers have a lot of flexibility in terms of the services they use (which they'd otherwise have to buy through processes that tend to be long and tedious), scale as needed when needed, shift work to the cloud provider, while the devops/governance/security people have some pretty neat tools to make sure all that's done in a safe and compliant manner. That tends to be worth it many times over for a lot of orgs, if done effectively with that aim, though it may not do much for companies with relatively stagnant or very simple products. If you want to reduce HR costs, cloud is probably not going to help much.
It seems like lots of companies start in the cloud due to low commitments, and then later when they have more stability and demand and want to save costs, making bigger cloud commitments (RIs, enterprise agreements etc) are a turnkey way to save money but always leave you on the lower-efficiency cloud track. Has anyone had good experiences selectively offloading workloads from the cloud to bare metal servers nearby?
One advantage I didn't see in the article was the performance costs of network latency. If you're running everything on one server, every DB interaction, microservice interaction, etc. would not necessarily need to go over the network. I think it is safe to say, IO is generally the biggest performance bottleneck of most web applications. Minimizing/negating that should not be underestimated.
I see these debates and wish there was an approach that scaled better.
A single server (and a backup) really _is_ great. Until it's not, for whatever reason.
We need more frameworks that scale from a single box to many boxes, without starting over from scratch. There are a lot of solid approaches: Erlang/Elxir and the actor model comes to mind. But that approach is not perfect, and it's far from common place.
> We need more frameworks that scale from a single box to many boxes, without starting over from scratch.
I'm not sure I really understand what you're saying here. I suppose most applications are some kind of CRUD app these days, not all sure, but an awful lot. If we take that as an example, how is it difficult to go from one box to multiple?
It's not something you get for free, you need to put in time to provision any new infra (be it baremetal or some kind of cloud instance) but the act of scaling out is pretty straight forward.
Perhaps you're talking about stateful applications?
I recommend the whitepaper Scalability! But at what cost?
My experience with Microservices is that they are very slow due to all the IO. We kind of want the development and developer scalability of decoupled services in addition to the computational and storage scalability in a disaggregated architecture.
One big sever, one big program, and one big 10x developer. Deploy websphere when you need isolation. The industry truly is going in spiral. Although, I must admit cloud providers really overplayed their hand when it comes to performance/buck and complexity.
What holds me back from doing this is how will I reduce latency from the calls coming from other side of the world when OVHcloud seemingly does not have datacenters all over the world? There is an noticeable lag when it comes to multiplayer games or even web applications.
So... I guess these folks haven't heard of latency before? Fairly sure you have to have "one big server" in every country if you do this. I feel like that would get rather costly compared to geographically distributed cloud services long term.
As opposed, to "many small servers" in every country? The vast majority of startups out there run out of a single AWS region with a CDN caching read-only content. You can apply the same CDN approach to a bare-metal server.
Yeah, but if I'm a startup and running only a small server, the cloud hosting costs are minimal. I'm not sure how you think it's cheaper to host tiny servers in lots of countries and pay someone to manage that for you. You'll need IT in every one of those locations to handle the service of your "small servers".
I run services globally for my company, there is no way we could do it. The fact that we just deploy containers to k8s all over the world works very well for us.
Before you give me the "oh k8s, well you don't know bare metal" please note that I'm an old hat that has done the legacy C# ASP.NET IIS workflows on bare metal for a long time. I have learned and migrated to k8s on AWS/GCloud and it is a huge improvement compared to what I used to deal with.
Lastly, as for your CDN discussion, we don't just host CDN's globally. We also host geo-located DB + k8s pods. Our service uses web sockets and latency is a real issue. We can't have 500 ms ping if we want to live update our client. We choose to host locally (in what is usually NOT a small server) so we get optimal ping for the live-interaction portion of our services that are used by millions of people every day.
This is one of those problems that basically no one has. RTT from Japan to Washington D.C. is 160ms. There's very few applications where that amount of additional latency matters.
It adds up surprisingly quickly when you have to do a TLS handshake, download many resources on pageload etc. The TLS handshake alone costs 3 round-trips over the network.
I once fired up an Azure instance with 4TB of RAM and hundreds of cores for a performance benchmark.
htop felt incredibly roomy, and I couldn’t help thin how my three previous projects would fit in with room to spare (albeit lacking redundancy, of course).
The problem with "one big server" is, you really need good IT/ops/sysadmin people who can think in non-cloud terms. (If you catch them installing docker on it, throw them into a lava pit immediately).
One server is for a hobby, not a business. Maybe that's fine, but keep that in mind. Backups at that level are something that keeps you from losing all data, not something that keeps you running and gets you up in any acceptable timeframe for most businesses.
That doesn't mean you need to use the cloud, it just means one big piece of hardware with all its single points of failure is often not enough. Two servers gets you so much more than one. You can make one a hot spare, or actually split services between them and have each be ready to take over for specific services for the other, greatly including your burst handling capability and giving you time to put more resources in place to keep n+1 redundancy going if you're using more than half of a server's resources.
Do they actually say they don't have a slave to that database ready to take over? I seriously doubt Let's Encrypt has no spare.
Note I didn't say you shouldn't run one service (as in daemon) or set of services from one box, just that one box is not enough and you need that spare.
It Let's Encrypt actually has no spare for their database server and they're one hardware failure away from being down for what may be a large chunk of time (I highly doubt it), then I wouldn't want to use them even if free. Thankfully, I doubt your interpretation of what that article is saying.
That says they use a single database, as in a logical MySQL database. I don't see any claim that they use a single server. In fact, the title of the article you've linked suggests they use multiple.
> But if I use Cloud Architecture, I Don’t Have to Hire Sysadmins
> Yes you do. They are just now called “Cloud Ops” and are under a different manager. Also, their ability to read the arcane documentation that comes from cloud companies and keep up with the corresponding torrents of updates and deprecations makes them 5x more expensive than system administrators.
I don't believe "Cloud Ops" is more complex than system administration, having studied for the CCNA so being on the Valley of Despair slope of the Dunning Kruger effect. If keeping up with cloud companies updates is that much of a challenge to warrant a 5x price over a SysAdmin then that's telling you something about their DX...
/tg/station, the largest open source multiplayer video game on github, gets cloudheads trying to help us "modernize" the game server for the cloud all the time.
Here's how that breaks down:
The servers (sorry, i mean compute) cost the same (before bandwidth, more on that at the bottom) to host one game server as we pay (amortized) per game server to host 5 game servers on a rented dedicated server. ($175/month for the rented server with 64gb of ram and a 10gbit uplink)
They run twice as slow because high core count slow clock speed servers aren't all they are cracked up to be, and our game engine is single threaded, but even if it wasn't, there is an overhead to multithreading things which combined with most high core count servers also having slow clock speed, rarely squares out to an actual increase in real world performance.
You can get the high clock speed units, they are twice to three times as expensive. And still run 20% slower over windows vms on rented bare metal because the sad fact is enterprise cpus by either intel or amd have slower clock speeds and single threaded performance then their gaming cpu counterparts, and getting gaming cpus for rented servers is piss easy, but next to impossible for cloud servers.
Each game server uses 2tb of bandwidth to host 70 player high pops. This works with 5 servers on 1 machine because our hosting provider gives us 15tb of bandwidth included in the price of the server.
Well now the cloud bill just got a new 0. 10 to 30x more expensive once you remember to price in bandwidth isn't looking too great.
"but it would make it cheaper for small downstreams to start out" until another youtuber mentions our tiny game, and every game server is hitting the 120 hard pop cap, and a bunch of downstreams get a surprise 4 digit bill for what would normally run 2 digits.
The take away from this being that even adding in docker or k8s deployment support to the game server is seen as creating the risk some kid bankrupts themselves trying to host a game server of their favorite game off their mcdonalds paycheck, and we tell such tech "pros" to sod off with their trendy money wasters.
Hetzner's PX line offers 64GB ECC RAM, Xeon CPU, dual 1TB NVME for < $100/month. A dedicated 10Gbit b/w link (plus 10Gbit NIC) is then an extra ~$40/month on top (incls. 20TB/month traffic, with overage billed at $1/TB).
All your eggs in one basket? A single host, really? Curmudgeonly opinions about microservices, cloud, and containers? Nostalgia for the time before 2010? All here. All you are missing is a rant about how the web was better before JavaScript.
It’s sad to see this kind of engineering malpractice voted to the top of HN. It’s even sadder to see how many people agree with it.
Yep, there's a premium on making your architecture more cloudy. However, the best point for Use One Big Server is not necessarily running your big monolithic API server, but your database.
Use One Big Database.
Seriously. If you are a backend engineer, nothing is worse than breaking up your data into self contained service databases, where everything is passed over Rest/RPC. Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
It is so much easier to do these joins efficiently in a single database than fanning out RPC calls to multiple different databases, not to mention dealing with inconsistencies, lack of atomicity, etc. etc. Spin up a specific reader of that database if there needs to be OLAP queries, or use a message bus. But keep your OLTP data within one database for as long as possible.
You can break apart a stateless microservice, but there are few things as stagnant in the world of software than data. It will keep you nimble for new product features. The boxes that they offer on cloud vendors today for managed databases are giant!
> Use One Big Database.
> Seriously. If you are a backend engineer, nothing is worse than breaking up your data into self contained service databases, where everything is passed over Rest/RPC. Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
This works until it doesn't and then you land in the position my company finds itself in where our databases can't handle the load we generate. We can't get bigger or faster hardware because we are using the biggest and fastest hardware you can buy.
Distributed systems suck, sure, and they make querying cross systems a nightmare. However, by giving those aspects up, what you gain is the ability to add new services, features, etc without running into scotty yelling "She can't take much more of it!"
Once you get to that point, it becomes SUPER hard to start splitting things out. All the sudden you have 10000 "just a one off" queries against several domains that are broken by trying carve out a domain into a single owner.
I don't know what's the complexity of your project, but more often than not the feeling of doom coming from hitting that wall is bigger than the actual effort it takes to solve it.
People often feel they should have anticipated and avoid the scaling issues altogether, but moving from a single DB to master/replica model, and/or shards or other solutions is fairly doable, and it doesn't come with worse tradeoffs than if you sharded/split services from the start. It always feels fragile and bolt on compared to the elegance of the single DB, but you'd also have many dirty hacks to have a multi DB setup work properly.
Also, you do that from a position where you usually have money, resources and a good knowledge of your core parts, which is not true when you're still growing full speed.
66 replies →
I've basically been building CRUD backends for websites and later apps since about 1996.
I've fortunately/unfortunately never yet been involved in a project that we couldn't comfortably host using one big write master and a handful of read slaves.
Maybe one day a project I'm involved with will approach "FAANG scale" where that stops working, but you can 100% run 10s of millions of dollars a month in revenue with that setup, at least in a bunch of typical web/app business models.
Early on I did hit the "OMG, we're cooking our database" where we needed to add read cacheing. When I first did that memcached was still written in Perl. So that joined my toolbox very early on (sometime in the late 90s).
Once read cacheing started to not keep up, it was easy enough to make the read cache/memcached layer understand and distribute reads across read slaves. I remember talking to Monty Widenius at The Open Source Conference, I think in Sad Jose around 2001 or so, about getting MySQL replication to use SSL so I could safely replicate to read slaves in Sydney and London from our write master in PAIX.
I have twice committed the sin of premature optimisation and sharded databases "because this one was _for sure_ going to get too big for our usual database setup". It only ever brought unneeded grief and never actually proved necessary.
Many databases can be distributed horizontally if you put in the extra work, would that not solve the problems you're describing? MariaDB supports at least two forms of replication (one master/replica and one multi-master), for example, and if you're willing to shell out for a MaxScale license it's a breeze to load balance it and have automatic failover.
5 replies →
Shouldn't your company have started to split things out and plan for hitting the limit of hardware a couple box sizes back? I feel there is a happy middle ground between "spend months making everything a service for our 10 users" and "welp i looks like we cant upsize the DB anymore, guess we should split things off now?"
Isn’t this easily solved with sharding ?
That is, one huge table keyed by (for instance) alphabet and when the load gets too big you split it into a-m and n-z tables, each on either their own disk or their own machine.
Then just keep splitting it like that. All of your application logic stays the same … everything stays very flat and simple … you just point different queries to different shards.
I like this because the shards can evolve from their own disk IO to their own machines… and later you can reassemble them if you acquire faster hardware, etc.
> Once you get to that point, it becomes SUPER hard to start splitting things out.
Maybe, but if you split it from the start you die by a thousand cuts, and likely pay the cost up front, even if you’d never get to the volumes that’d require a split.
>Once you get to that point, it becomes SUPER hard to start splitting things out. All the sudden you have 10000 "just a one off" queries against several domains that are broken by trying carve out a domain into a single owner.
But that's survivorship bias and looking back at things from current problems perspective.
You know what's the least future proof and scalable project ? The one that gets canceled because they failed to deliver any value in reasonable time in the early phase. Once you get to "huge project status" you can afford glacial pace. Most of the time you can't afford that early on - so even if by some miracle you knew what scaling issues you're going to have long term and invested in fixing them early on - it's rarely been a good tradeoff in my experience.
I've seen more projects fail because they tangle themselves up in unnecessary complexity early on and fail to execute on core value proposition, than I've seen fail from being unable to manage the tech debt 10 years in. Developers like to complain about the second, but they get fired on the first kind. Unfortunately in todays job market they just resume pad their failures as "relevant experience" and move on to the next project - so there is not correcting feedback.
I'd be curious to know what your company does which generates this volume of data (if you can disclose), what database you are using and how you are planning to solve this issue.
2 replies →
You can get a machine with multiple terabytes of ram and hundreds of CPU cores easily. If you can afford that, you can afford a live replica to switch to during maintenance.
FastComments runs on one big DB in each region, with a hot backup... no issues yet.
Before you go to microservices you can also shard, as others have mentioned.
Do you have Spectre countermeasures active in the kernel of that machine?
4 replies →
Why can’t the databases handle the load? That is to say, did you see this coming from a while away or was it a surprise?
Wonder if this would be a good use case for Tidalscale?
This is absolutely true - when I was at Bitbucket (ages ago at this point) and we were having issues with our DB server (mostly due to scaling), almost everyone we talked to said "buy a bigger box until you can't any more" because of how complex (and indirectly expensive) the alternatives are - sharding and microservices both have a ton more failure points than a single large box.
I'm sure they eventually moved off that single primary box, but for many years Bitbucket was run off 1 primary in each datacenter (with a failover), and a few read-only copies. If you're getting to the point where one database isn't enough, you're either doing something pretty weird, are working on a specific problem which needs a more complicated setup, or have grown to the point where investing in a microservice architecture starts to make sense.
One issue I've seen with this is that if you have a single, very large database, it can take a very, very long time to restore from backups. Or for that matter just taking backups.
I'd be interested to know if anyone has a good solution for that.
22 replies →
What if your product simply stores a lot of data (ie a search engine) How is that weird?
8 replies →
I'm glad this is becoming conventional wisdom. I used to argue this in these pages a few years ago and would get downvoted below the posts telling people to split everything into microservices separated by queues (although I suppose it's making me lose my competitive advantage when everyone else is building lean and mean infrastructure too).
In my mind, reasons involve keeping transactional integrity, ACID compliance, better error propagation, avoiding the hundreds of impossible to solve roadblocks of distributed systems (https://groups.csail.mit.edu/tds/papers/Lynch/MIT-LCS-TM-394...).
But also it is about pushing the limits of what is physically possible in computing. As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.
Physical efficiency is about keeping data close to where it's processed. Monoliths can make much better use of L1, L2, L3, and ram caches than distributed systems for speedups often in the order of 100X to 1000X.
Sure it's easier to throw more hardware at the problem with distributed systems but the downsides are significant so be sure you really need it.
Now there is a corollary to using monoliths. Since you only have one db, that db should be treated as somewhat sacred, you want to avoid wasting resources inside it. This means being a bit more careful about how you are storing things, using the smallest data structures, normalizing when you can etc. This is not to save disk, disk is cheap. This is to make efficient use of L1,L2,L3 and ram.
I've seen boolean true or false values saved as large JSON documents. {"usersetting1": true, "usersetting2":fasle "setting1name":"name" etc.} with 10 bits of data ending up as a 1k JSON document. Avoid this! Storing documents means, the keys, the full table schema is in every row. It has its uses but if you can predefine your schema and use the smallest types needed, you are gaining much performance mostly through much higher cache efficiency!
> I'm glad this is becoming conventional wisdom
It's not though. You're just seeing the most popular opinion on HN.
In reality it is nuanced like most real-world tech decisions are. Some use cases necessitate a distributed or sharded database, some work better with a single server and some are simply going to outsource the problem to some vendor.
2 replies →
> I'm glad this is becoming conventional wisdom
My hunch is that computers caught up. Back in the early 2000's horizontal scaling was the only way. You simply couldn't handle even reasonably mediocre loads on a single machine.
As computing becomes cheaper, horizontal scaling is starting to look more and more like unnecessary complexity for even surprisingly large/popular apps.
I mean you can buy a consumer off-the-shelf machine with 1.5TB of memory these days. 20 years ago, when microservices started gaining popularity, 1.5TB RAM in a single machine was basically unimaginable.
1 reply →
'over the wire' is less obvious than it used to be.
If you're in k8s pod, those calls are really kernel calls. Sure you're serializing and process switching where you could be just making a method call, but we had to do something.
I'm seeing less 'balls of mud' with microservices. Thats not zero balls of mud. But its not a given for almost every code base I wander into.
3 replies →
>"I'm glad this is becoming conventional wisdom. "
Yup, this is what I've always done and it works wonders. Since I do not have bosses, just a clients I do not give a flying fuck about latest fashion and do what actually makes sense for me and said clients.
I've never understood this logic for webapps. If you're building a web application, congratulations, you're building a distributed system, you don't get a choice. You can't actually use transactional integrity or ACID compliance because you've got to send everything to and from your users via HTTP request/response. So you end up paying all the performance, scalability, flexibility, and especially reliability costs of an RDBMS, being careful about how much data you're storing, and getting zilch for it, because you end up building a system that's still last-write-wins and still loses user data whenever two users do anything at the same time (or you build your own transactional logic to solve that - exactly the same way as you would if you were using a distributed datastore).
Distributed systems can also make efficient use of cache, in fact they can do more of it because they have more of it by having more nodes. If you get your dataflow right then you'll have performance that's as good as a monolith on a tiny dataset but keep that performance as you scale up. Not only that, but you can perform a lot better than an ACID system ever could, because you can do things like asynchronously updating secondary indices after the data is committed. But most importantly you have easy failover from day 1, you have easy scaling from day 1, and you can just not worry about that and focus on your actual business problem.
Relational databases are largely a solution in search of a problem, at least for web systems. (They make sense as a reporting datastore to support ad-hoc exploratory queries, but there's never a good reason to use them for your live/"OLTP" data).
8 replies →
>As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.
Even accounting for CDNs, a distributed system is inherently more capable of bringing data closer to geographically distributed end users, thus lowering latency.
I think a strong test a lot of "let's use Google scale architecture for our MVP" advocates fail is: can your architecture support a performant paginated list with dynamic sort, filter and search where eventual consistency isn't acceptable?
Pretty much every CRUD app needs this at some point and if every join needs a network call your app is going to suck to use and suck to develop.
I’ve found the following resource invaluable for designing and creating “cloud native” APIs where I can tackle that kind of thing from the very start without a huge amount of hassle https://google.aip.dev/general
The patterns section covers all of this and more
1 reply →
I don't believe you. Eventual consistency is how the real world works, what possible use case is there where it wouldn't be acceptable? Even if you somehow made the display widget part of the database, you can't make the reader's eyeballs ACID-compliant.
2 replies →
> if every join needs a network call your app is going to suck to use and suck to develop.
And yet developers do this every single day without any issue.
It is bad practice to have your authentication database be the same as your app database. Or you have data coming from SaaS products, third party APIs or a cloud service. Or even simply another service in your stack. And with complex schemas often it's far easier to do that join in your application layer.
All of these require a network call and join.
3 replies →
> Pretty much every CRUD app needs this at some point and if every join needs a network call your app is going to suck to use and suck to develop.
_at some point_ is the key word here.
Most startups (and businesses) can likely get away with this well into Series A or Series B territory.
thanks a lot for this comment. I will borrow this as an interview question :)
> Use One Big Database.
I emphatically disagree.
I've seen this evolve into tightly coupled microservices that could be deployed independently in theory, but required exquisite coordination to work.
If you want them to be on a single server, that's fine, but having multiple databases or schemas will help enforce separation.
And, if you need one single place for analytics, push changes to that space asynchronously.
Having said that, I've seen silly optimizations being employed that make sense when you are Twitter, and to nobody else. Slice services up to the point they still do something meaningful in terms of the solution and avoid going any further.
I have done both models. My previous job we had a monolith on top of a 1200 table database. Now I work in an ecosystem of 400 microservices, most with their own database.
What it fundamentally boils down to is that your org chart determines your architecture. We had a single team in charge of the monolith, and it was ok, and then we wanted to add teams and it broke down. On the microservices architecture, we have many teams, which can work independently quite well, until there is a big project that needs coordinated changes, and then the fun starts.
Like always there is no advice that is absolutely right. Monoliths, microservices, function stores. One big server vs kubernetes. Any of those things become the right answer in the right context.
Although I’m still in favor of starting with a modular monolith and splitting off services when it becomes apparent they need to change at a different pace from the main body. That is right in most contexts I think.
2 replies →
To clarify the advice, at least how I believe it should be done…
Use One Big Database Server…
… and on it, use one software database per application.
For example, one Postgres server can host many databases that are mostly* independent from each other. Each application or service should have its own database and be unaware of the others, communicating with them via the services if necessary. This makes splitting up into multiple database servers fairly straightforward if needed later. In reality most businesses will have a long tail of tiny databases that can all be on the same server, with only bigger databases needing dedicated resources.
*you can have interdependencies when you’re using deep features sometimes, but in an application-first development model I’d advise against this.
6 replies →
There's no need for "microservices" in the first place then. That's just logical groupings of functionality that can be separate as classes, namespaces or other modules without being entirely separate processes with a network boundary.
Yeah... Dividing your work into microservices while your data is in an interdependent database doesn't lead to great results.
If you are creating microservices, you must segment them all the way through.
4 replies →
Breaking apart a stateless microservice and then basing it around a giant single monolithic database is pretty pointless - at that stage you might as well just build a monolith and get on with it as every microservice is tightly coupled to the db.
To note that quite a bit of the performance problems come when writing stuff. You can get away with A LOT if you accept 1. the current service doesn't do (much) writing and 2. it can live with slightly old data. Which I think covers 90% of use cases.
So you can end up with those services living on separate machines and connecting to read only db replicas, for virtually limitless scalability. And when it realizes it needs to do an update, it either switches the db connection to a master, or it forwards the whole request to another instance connected to a master db.
That's true, unless you need
(1) Different programming languages e.g. you're written your app in Java but now you need to do something for which the perfect Python library is available.
(2) Different parts of your software need different types of hardware. Maybe one part needs a huge amount of RAM for a cache, but other parts are just a web server. It'd be a shame to have to buy huge amounts of RAM for every server. Splitting the software up and deploying the different parts on different machines can be a win here.
I reckon the average startup doesn't need any of that, not suggesting that monoliths aren't the way to go 90% of the time. But if you do need these things, you can still go the microservices route, but it still makes sense to stick to a single database if at all possible, for consistency and easier JOINs for ad-hoc queries, etc.
1 reply →
No disagreement here. I love a good monolith.
Agree. Nothing worse than having different programs changing data in the same database. The database should not be an integration point between services.
4 replies →
Why would you break apart a microservice? Any why do you need to use/split into microservices anyway?
99% of apps are best fit as monolithic apps and databases and should focus on business value rather than scale they'll never see.
10 replies →
I disagree. Suppose you have an enormous DB that's mainly written to by workers inside a company, but has to be widely read by the public outside. You want your internal services on machines with extra layers of security, perhaps only accessible by VPN. Your external facing microservices have other things like e.g. user authentication (which may be tied to a different monolithic database), and you want to put them closer to users, spread out in various data centers or on the edge. Even if they're all bound to one database, there's a lot to recommend keeping them on separate, light cheap servers that are built for http traffic and occasional DB reads. And even more so if those services do a lot of processing on the data that's accessed, such as building up reports, etc.
1 reply →
Absolutely. I know someone who considers "different domains" (as in web domains) to count as a microservice!
What is the point of that? it doesn't add anything. Just more shit to remember and get right (and get wrong!)
> "Use One Big Database."
yah, this is something i learned when designing my first server stack (using sun machines) for a real business back during the dot-com boom/bust era. our single database server was the beefiest machine by far in the stack, 5U in the rack (we also had a hot backup), while the other servers were 1U or 2U in size. most of that girth was for memory and disk space, with decent but not the fastest processors.
one big db server with a hot backup was our best tradeoff for price, performance, and reliability. part of the mitigation was that the other servers could be scaled horizontally to compensate for a decent amount of growth without needing to scale the db horizontally.
Definitely use a big database, until you can't. My advice to anyone starting with a relational data store is to use a proxy from day 1 (or some point before adding something like that becomes scary).
When you need to start sharding your database, having a proxy is like having a super power.
Disclaimer: I am the founder of PolyScale [1].
We see both use cases: single large database vs multiple small, decoupled. I agree with the sentiment that a large database offer simplicity, until access patterns change.
We focus on distributing database data to the edge using caching. Typically this eliminates read-replicas and a lot of the headache that goes with app logic rewrites or scaling "One Big Database".
[1] https://www.polyscale.ai/
Are there postgres proxies that can specifically facilitate sharding / partitioning later?
> Use One Big Database
Yep, with a passive replica or online (log) backup.
Keeping things centralized can reduce your hardware requirement by multiple orders of magnitude. The one huge exception is a traditional web service, those scale very well, so you may not even want to get big servers for them (until you need them).
If you do this then you'll have the hardest possible migration when the time comes to split it up. It will take you literally years, perhaps even a decade.
Shard your datastore from day 1, get your dataflow right so that you don't need atomicity, and it'll be painless and scale effortlessly. More importantly, you won't be able to paper over crappy dataflow. It's like using proper types in your code: yes, it takes a bit more effort up-front compared to just YOLOing everything, but it pays dividends pretty quickly.
This is true IFF you get to the point where you have to split up.
I know we're all hot and bothered about getting our apps to scale up to be the next unicorn, but most apps never need to scale past the limit of a single very high-performance database. For most people, this single huge DB is sufficient.
Also, for many (maybe even most) applications, designated outages for maintenance are not only acceptable, but industry standard. Banks have had, and continue to have designated outages all the time, usually on weekends when the impact is reduced.
Sure, what I just wrote is bad advice for mega-scale SaaS offerings with millions of concurrent users, but most of us aren't building those, as much as we would like to pretend that we are.
I will say that TWO of those servers, with some form of synchronous replication, and point in time snapshots, are probably a better choice, but that's hair-splitting.
(and I am a dyed in the wool microservices, scale-out Amazon WS fanboi).
2 replies →
> If you do this then you'll have the hardest possible migration when the time comes to split it up. It will take you literally years, perhaps even a decade.
At which point a new OneBigServer will be 100x as powerful, and all your upfront work will be for nothing.
> Shard your datastore from day 1
what about using something like cocroach from day 1?
2 replies →
> Use One Big Database.
It’s never one big database. Inevitably there are are backups, replicas, testing environments, staging, development. In an ideal unchanging world where nothing ever fails and workload is predictable then the one big database is also ideal.
What happens in the real world is that the one big database becomes such a roadblock to change and growth that organisations often throw away the whole thing and start from scratch.
> It’s never one big database. Inevitably there are are backups, replicas, testing environments, staging, development. In an ideal unchanging world where nothing ever fails and workload is predictable then the one big database is also ideal.
But if you have many small databases, you need
> backups, replicas, testing environments, staging, development
all times `n`. Which doesn't sound like an improvement.
> What happens in the real world is that the one big database becomes such a roadblock to change and growth that organisations often throw away the whole thing and start from scratch.
Bad engineering orgs will clutch defeat from the jaws of victory no matter what the early architectural decisions were. The one vs many databases/services is almost moot entirely.
Just FYI, you can have one big database, without running it on one big server. As an example, databases like Cassandra are designed to be scaled horizontally (i.e. scale out, instead of scale up).
https://cassandra.apache.org/_/cassandra-basics.html
There are trade-offs when you scale horizontally even if a database is designed for it. For example, DataStax's Storage Attached Indexes or Cassandra's hidden-table secondary indexing allow for indexing on columns that aren't part of the clustering/partitioning, but when you're reading you're going to have to ask all the nodes to look for something if you aren't including a clustering/partitioning criteria to narrow it down.
You've now scaled out, but you now have to ask each node when searching by secondary index. If you're asking every node for your queries, you haven't really scaled horizontally. You've just increased complexity.
Now, maybe 95% of your queries can be handled with a clustering key and you just need secondary indexes to handle 5% of your stuff. In that case, Cassandra does offer an easy way to handle that last 5%. However, it can be problematic if people take shortcuts too much and you end up putting too much load on the cluster. You're also putting your latency for reads at the highest latency of all the machines in your cluster. For example, if you have 100 machines in your cluster with a mean response time of 2ms and a 99th percentile response time of 150ms, you're potentially going to be providing a bad experience to users waiting on that last box on secondary index queries.
This isn't to say that Cassandra isn't useful - Cassandra has been making some good decisions to balance the problems engineers face. However, it does come with trade-offs when you distribute the data. When you have a well-defined problem, it's a lot easier to design your data for efficient querying and partitioning. When you're trying to figure things out, the flexibility of a single machine and much cheaper secondary index queries can be important - and if you hit a massive scale, you figure out how you want to partition it then.
1 reply →
Cassandra may be great when you have to scale your database that you no longer develop significantly. The problem with this DB system is that you have to know all the queries before you can define the schema.
1 reply →
A relative worked for a hedge fund that used this idea. They were a C#/MSSQL shop, so they just bought whatever was the biggest MSSQL server at the time, updating frequently. They said it was a huge advantage, where the limit in scale was more than offset by productivity.
I think it's an underrated idea. There's a lot of people out there building a lot of complexity for datasets that in the end are less than 100 TB.
But it also has limits. Infamously Twitter delayed going to a sharded architecture a bit too long, making it more of an ugly migration.
Server hardware is so cheap and fast today that 99% of companies will never hit that limit in scale either.
>"Use One Big Database."
I do, it is running on the same big (relatively) server as my native C++ backend talking to the database. The performance smokes your standard cloudy setup big time. Serving thousand requests per second on 16 core without breaking sweat. I am all for monoliths running on real no cloudy hardware. As long as the business scale is reasonable and does not approach FAANG (like for 90% of the businesses) this solution is superior to everything else money, maintenance, development time wise.
I agree with this sentiment but it is often misunderstood as a means to force everything into a single database schema. More people need to learn about logically separating schemas with their database servers!
Another area for consolidation is auth. Use one giant keycloak, with individual realms for every one of the individual apps you are running. Your keycloak is back ended by your one giant database.
I agree that 1BDB is a good idea, but having one ginormous schema has its own costs. So I still think data should be logically partitioned between applications/microservices - in PG terms, one “cluster” but multiple “databases”.
We solved the problem of collecting data from the various databases for end users by having a GraphQL layer which could integrate all the data sources. This turned out to be absolutely awesome. You could also do something similar using FDW. The effort was not significant relative to the size of the application.
The benefits of this architecture were manifold but one of the main ones is that it reduces the complexity of each individual database, which dramatically improved performance, and we knew that if we needed more performance we could pull those individual databases out into their own machine.
I'd say, one big database per service. Often times there are natural places to separate concerns and end up with multiple databases. If you ever want to join things for offline analysis, it's not hard to make a mapreduce pipeline of some kind that reads from all of them and gives you that boundless flexibility.
Then if/when it comes time for sharding, you probably only have to worry about one of those databases first, and you possibly shard it in a higher-level logical way that works for that kind of service (e.g. one smaller database per physical region of customers) instead of something at a lower level with a distributed database. Horizontally scaling DBs sound a lot nicer than they really are.
>>(they don't know how your distributed databases look, and oftentimes they really do not care)
Nor should they, it's the engineer's/team's job to provide the database layer to them with high levels of service without them having to know the details
>Use One Big Database.
It may be reasonable to have two databases e.g. a class a and class b for pci compliance. So context still deeply matters.
Also having a dev DB with mock data and a live DB with real data is a common setup in many companies.
I'm pretty happy to pay a cloud provider to deal with managing databases and hosts. It doesn't seem to cause me much grief, and maybe I could do it better but my time is worth more than our RDS bill. I can always come back and Do It Myself if I run out of more valuable things to work on.
Similarly, paying for EKS or GKE or the higher-level container offerings seems like a much better place to spend my resources than figuring out how to run infrastructure on bare VMs.
Every time I've seen a normal-sized firm running on VMs, they have one team who is responsible for managing the VMs, and either that team is expecting a Docker image artifact or they're expecting to manage the environment in which the application runs (making sure all of the application dependencies are installed in the environment, etc) which typically implies a lot of coordination between the ops team and the application teams (especially regarding deployment). I've never seen that work as smoothly as deploying to ECS/EKS/whatever and letting the ops team work on automating things at a higher level of abstraction (automatic certificate rotation, automatic DNS, etc).
That said, I've never tried the "one big server" approach, although I wouldn't want to run fewer than 3 replicas, and I would want reproducibility so I know I can stand up the exact same thing if one of the replicas go down as well as for higher-fidelity testing in lower environments. And since we have that kind of reproducibility, there's no significant difference in operational work between running fewer larger servers and more smaller servers.
"Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care)."
This isn't a problem if state is properly divided along the proper business domain and the people who need to access the data have access to it. In fact many use cases require it - publicly traded companies can't let anyone in the organization access financial info and healthcare companies can't let anyone access patient data. And of course are performance concerns as well if anyone in the organization can arbitrarily execute queries on any of the organization's data.
I would say YAGNI applies to data segregation as well and separations shouldn't be introduced until they are necessary.
"combine these data sources" doesn't necessarily mean data analytics. Just as an example, it could be something like "show a badge if it's the user's birthday", which if you had a separate microservice for birthdays would be much harder than joining a new table.
1 reply →
At my current job we have four different databases so I concur with this assessment. I think it's okay to have some data in different DBs if they're significantly different like say the user login data could be in its own database. But anything that we do which is a combination of e-commerce and testing/certification I think they should be in one big database so I can do reasonable queries for information that we need. This doesn't include two other databases we have on-prem which one is a Salesforce setup and another is an internal application system that essentially marries Salesforce to that. It's a weird wild environment to navigate when adding features.
> Your product asks will consistently want to combine these data sources (they don't know how your distributed databases look, and oftentimes they really do not care).
I'm not sure how to parse this. What should "asks" be?
The phrase "Your product asks will consistently " can be de-abbreviated to "product owners/product managers you work with will consistently request".
The feature requests (asks) that product wants to build - sorry for the confusion there.
Mostly agree, but you have to be very strict with the DB architecture. Have very reasonable schema. Punish long running queries. If some dev group starts hammering the DB cut them off early on, don't let them get away with it and then refuse to fix their query design.
The biggest nemesis of big DB approach are dev teams who don't care about the impact of their queries.
Also move all the read-only stuff that can be a few minutes behind to a separate (smaller) server with custom views updated in batches (e.g. product listings). And run analytics out of peak hours and if possible in a separate server.
The rule is: Keep related data together. Exceptions are: Different customers (usually don't require each others data) can be isolated. And if the database become the bottleneck you can separate unrelated services.
Surely having separate DBs all sit on the One Big Server is preferable in many cases. For cases where you really to extract large amounts of data that is derived from multiple DBs, there's no real harm in having some cross-DB joins defined in views somewhere. If there are sensible logical ways to break a monolithic service into component stand-alone services, and good business reasons to do (or it's already been designed that way), then having each talk to their own DB on a shared server should be able to scale pretty well.
Not to mention, backups, restores, and disaster recovery are so much easier with One Big Database™.
How is backup restoration any easier if your whole PostgreSQL cluster goes back in time when you only wanted to rewind that one tenant?
1 reply →
If you get your services right there is little or no communications between the services since a microservice should have all the data it needs in it's own store.
> they don't know how your distributed databases look, and oftentimes they really do not care
Nor should they.
How do you use one big database when some of your info is stuck in an ERP system?
This is the macro version of the von Neumann bottleneck.
Our industry summarized:
Hardware engineers are pushing the absolute physical limits of getting state (memory/storage) as close as possible to compute. A monumental accomplishment as impactful as the invention of agriculture and the industrial revolution.
Software engineers: let's completely undo all that engineering by moving everything apart as far as possible. Hmmm, still too fast. Let's next add virtualization and software stacks with shitty abstractions.
Fast and powerful browser? Let's completely ignore 20 years of performance engineering and reinvent...rendering. Hmm, sucks a bit. Let's add back server rendering. Wait, now we have to render twice. Ah well, let's just call it a "best practice".
The mouse that I'm using right now (an expensive one) has a 2GB desktop Electron app that seems to want to update itself twice a week.
The state of us, the absolute garbage that we put out, and the creative ways in which we try to justify it. It's like a mind virus.
I want my downvotes now.
Actually, for those who push for these cloudy solutions, they do that in part to make data close to you. I am talking mostly about CDNs, I don't thing YouTube and Netflix would have been possible without them.
Google is a US company, but you don't want people in Australia to connect to the other side of the globe every time they need to access Google services, it would be an awful waste of intercontinental bandwidth. Instead, Google has data centers in Australia to serve people in Australia, and they only hit US servers when absolutely needed. And that's when you need to abstract things out. If something becomes relevant in Australia, move it in there, and move it out when it no longer matters. When something big happens, copy it everywhere, and replace the copies by something else as interest wanes.
Big companies need to split everything, they can't centralize because the world isn't centralized. The problem is when small businesses try to do the same because "if Google is so successful doing that, it must be right". Scale matters.
You're right on the CDN part, but my criticism was highly generic, for sure this doesn't mean every single distributed architecture is a bad idea.
Distributed means different things in different contexts.
CDN = good distribution.
Microservices = bad distribution.
Agreed and I think it's easier to compare tech to the movie industry. Just look at all the crappy movies they produce with IMDB ratings below 5 out of 10, that is movies that nobody's going to even watch; then there are the shitty blockbusters with expensive marketing and greatly simplified stories optimized for mindless blockbuster movie goers; then there are rare gems, true works of art that get recognized at festivals at best but usually not by the masses. The state of the movie industry is overall pathetic, and I see parallels with the tech here.
> Software engineers: let's completely undo all that engineering by moving everything apart as far as possible. Hmmm, still too fast. Let's next add virtualization and software stacks with shitty abstractions.
That's because the concept which is even more impactful than agriculture and the computer, and makes them and everything else in our lives, is abstraction. It makes it possible to reason about large and difficult problems, to specialize, to have multiple people working on them.
Computer hardware is as full of abstraction and separation and specialization as software is. The person designing the logic for a multiplier unit has no more need to know how transistors are etched into silicon than a javascript programmer does.
None of that means anything.
The web is slower than ever. Desktop apps 20 years ago were faster than today's garbage. We failed.
4 replies →
You've more or less described Wirth's Law: https://en.wikipedia.org/wiki/Wirth%27s_law
Heh, there's a mention here to Andy and Bill's Law, "What Andy giveth, Bill taketh away," which is a reference to Andy Grove (Intel) and Bill Gates (Microsoft).
Since I have a long history with Sun Microsystems, upon seeing "Andy and Bill's Law" I immediately thought this was a reference to Andy Bechtolsheim (Sun hardware guy) and Bill Joy (Sun software guy). Sun had its own history of software bloat, with the latest software releases not fitting into contemporary hardware.
I had no idea, thanks. Consider this a broken clock being sometimes right.
> The mouse that I'm using right now (an expensive one) has a 2GB desktop Electron app that seems to want to update itself twice a week.
I'm using a Logitech MX Master 3, and it comes with the "Logi Options+" to configure the mouse. I'm super frustrated with the cranky and slow app. It updates every other day and crashes often.
The experience is much better when I can configure the mouse with an open-source driver [^0] while using Linux.
[^0] https://github.com/PixlOne/logiops
I use Logi Options too, but while it's stable for me, it still uses a bafflingly high amount of CPU. But if I don't run Logi Options, then mouse buttons 3+4 stop working :-/
It's been like that for years.
Logitech's hardware is great, so I don't know why they think it's OK to push out such shite software.
Jonathan Blow has a talk about exactly this, called Preventing the Collapse of Civilisation []
[] https://www.youtube.com/watch?v=ZSRHeXYDLko
Let me add fuel to the fire. When I started my career, users were happy to select among a handful of 8x8 bitmap font. Nowadays, users expect to see a scalable male-doctor-skin-ton-1 emoji. The former can be implemented by bliting 8 bytes from ROM. The latter requires an SVG engine -- just to render one character.
While bloatware cannot be excluded, let's not forget that user expectations have temendously increased.
Downvotes? But you're absolutely right. What an embarrassing industry to be a part of.
We're not a very serious industry. Despite uhm, it pretty much running the world. We're a joke. Sometimes I feel it doesn't even earn the term "engineering" at all, and rather than improving, it seems to get ever worse.
Which really is a stunning accomplishment in a backdrop of spectacular hardware advances, ever more educated people, and other favorable ingredients.
3 replies →
Software engineers don't want to be managing physical hardware and often need to run highly available services. When a team lacks the skill, geographic presence or bandwidth to manage physical servers but needs to deliver a highly-available service, I think the cloud offers legitimate improvements in operations with downsides such as increased cost and decreased performance per unit of cost.
Seems like a fair trade-off to make.
> Software engineers don't want to be managing physical hardware
Speak for yourself, I need to get some use out of my winter jacket ever since winters stopped being a thing.
Every new generation wants to invent a wheel until they learn it is already invented.
> However, cloud providers have often had global outages in the past, and there is no reason to assume that cloud datacenters will be down any less often than your individual servers.
A nice thing about being in a big provider is when they go down a massive portion of the internet goes down, and it makes news headlines. Users are much less likely to complain about your service being down when it's clear you're just caught up in the global outage that's affecting 10 other things they use.
This is a huge one -- value in outsourcing blame. If you're down because of a major provider outage in the news, you're viewed more as a victim of a natural disaster rather than someone to be blamed.
I hear this repeated so many times at my workplace, and it's so totally and completely uninformed.
Customers who have invested millions of dollars into making their stack multi-region, multi-cloud, or multi-datacenter aren't going to calmly accept the excuse that "AWS Went Down" when you can't deliver the services you contractually agreed to deliver. There are industries out there where having your service casually go down a few times a year is totally unacceptable (Healthcare, Government, Finance, etc). I worked adjacent to a department that did online retail a while ago and even an hour of outage would lose us $1M+ in business.
5 replies →
This seems like a recently popular exaggeration, I'd wager no one but a select few in the HN-bubble actually cares.
You will primarily be judged by how much of an inconvenience the outage was to every individual.
The best you can hope for is that the local ISP gets the blame, but honestly. It can't be more than a rounding error in the end.
6 replies →
Agreed. Recently I was discussing the same point with a non-technical friend who was explaining that his CTO had decided to move from Digital Ocean to AWS, after DO experienced some outage. Apparently the CEO is furious at him and has assumed that DO are the worst service provider because their services were down for almost an entire business day. The CTO probably knows that AWS could also fail in a similar fashion, but by moving to AWS it becomes more or less an Act of God type of situation and he can wash his hands of it.
I find this entire attitude disappointing. Engineering has moved from "provide the best reliability" to "provide the reliability we won't get blamed for the failure of". Folks who have this attitude missed out on the dang ethics course their college was teaching.
If rolling your own is faster, cheaper, and more reliable (it is), then the only justification for cloud is assigning blame. But you know what you also don't get? Accolades.
I throw a little party of one here when Office 365 or Azure or AWS or whatever Google calls it's cloud products this week is down but all our staff are able to work without issue. =)
"Value in outsourcing blame"
The real reason that talented engineers secretly support all of the middle management we vocally complain about.
If you work in B2B you can put the blame on Amazon and your customers will ask "understandable, take the necessary steps to make sure it doesn't happen again". AWS going down isn't an act of God, it's something you should've planned for, especially if it happened before.
So it does not really work in B2B.
I don't really have much to do with contracts - but my company is stating that we have up time of 99.xx%.
In terms of contract customers don't care if I have Azure/AWS or I keep my server in the box under the stairs. Yes they do due diligence and would not buy my services if I keep it in shoe box.
But then if they loose business they come to me .. I can go after Azure/AWS but I am so small they will throw some free credits and me and tell to go off.
Maybe if you are in B2C area then yeah - your customers will probably shrug and say it was M$ or Amazon if you write sad blog post with excuses.
4 replies →
Users are much more sympathetic to outages when they're widespread. But, if there's a contractual SLA then their sympathy doesn't matter. You have to meet your SLA. That usually isn't a big problem as SLAs tend to account for some amount of downtime, but it's important to keep the SLA in mind.
This just holds when you are b2b. If you’re serving end users, they don’t care about the contract, they care about their UX.
This has given me a brilliant idea: deferring maintenance downtime until some larger user-visible service is down.
This is terrible for many reasons, but I wouldn't be surprised to hear someone has done this.
Ah yes, the 'who cut the cheese?' maintenance window.
There is also the consideration that this isn't even an argument of "other things are down too!" or "outsourcing blame" as much as, depending on what your service is of course, you are unlikely to be operating in a bubble. You likely have some form of external dependencies, or you are an external dependency, or have correlated/cross-dependency usage with another service.
Guaranteeing isolation between all of these different moving parts is very difficult. Even if you're not directly affected by a large cloud outage, it's becoming less-and-less common that you, or your customers, are truely isolated.
As well, if your AWS-hosted service mostly exists to service AWS-hosted customers, and AWS is down, it doesn't matter if you are down. None of your customers are operational anyways. Is this a 100% acceptable solution? Of course not. But for 95% of services/SaaS out there, it really doesn't matter.
I can't tell if this is a good thing or a bad thing though!
Imagine the clout of saying : "we stayed online while AWS died"
Depends on how technical your customer base is. Even as a developer I would tend not to ascribe too much signal to that message. All it tells me is that you don't use AWS.
"We stayed online when GCP, AWS, and Azure go down" is a different story. On the other hand, if those three go down simultaneously, I suspect the state of the world will be such that I'm not worried about the internet.
4 replies →
HN implicitly gets this clout - it became the real status page of most of the internet.
You also have to calculate in the complexity of running thousands of servers vs running just one server. If you run just one server it's unlikely to go down even once in it's lifetime. Meanwhile cloud providers are guaranteed to have outages due to the share complexity of managing thousands of servers.
Nobody ever got fired for buying IBM!
We may need to update this one, I would definitely fire someone today for buying IBM.
5 replies →
When migrating from [no-name CRM] to [big-name CRM] at a recent job, the manager pointed out that when [big-name CRM] goes down, it's in the Wall Street Journal, and when [no-name] goes down, it's hard to get their own Support Team to care!
No. Your users have no idea that you rely on AWS (they don't even know what it is), and they don't think of it as a valid or reasonable excuse as to why your service is down.
Another advantage is that the third-party services you depend on are also likely to be on one of the big providers, so it's one less point of failure.
If you are not maxing out or even getting above 50% utilization of 128 physical cores (256 threads), 512 GB of memory, and 50 Gbps of bandwidth for $1,318/month, I really like the approach of multiple low-end consumable computers as servers. I have been using arrays of Intel NUCs at some customer sites for years with considerable cost savings over cloud offerings. Keep an extra redundant one in the array ready to swap out a failure.
Another often overlooked option is that in several fly-over states it is quite easy and cheap to register as a public telecommunication utility. This allows you to place a powered pedestal in the public right-of-way, where you can get situated adjacent to an optical meet point and get considerable savings on installation costs of optical Internet, even from a tier 1 provider. If your server bandwidth is peak utilized during business hours and there is an apartment complex nearby you can use that utility designation and competitively provide residential Internet service to offset costs.
I uh. Providing residential Internet for an apartment complex feels like an entire business in and of itself and wildly out of scope for a small business? That's a whole extra competency and a major customer support commitment. Is there something I'm missing here?
It depends on the scale - it does not have to be a major undertaking. You are right, it is a whole extra competency and a major customer support commitment, but for a lot of the entrepreneurial folk on HN quite a rewarding and accessible learning experience.
The first time I did anything like this was in late 1984 in a small town in Iowa where GTE was the local telecommunication utility. Absolutely abysmal Internet service, nothing broadband from them at the time or from the MSO (Mediacom). I found out there was a statewide optical provider with cable going through the town. I incorporated an LLC, became a utility and built out less than 2 miles of single mode fiber to interconnect some of my original software business customers at first. Our internal moto was "how hard can it be?" (more as a rebuke to GTE). We found out. The whole 24x7 public utility thing was very difficult for just a couple of guys. But it grew from there. I left after about 20 years and today it is a thriving provider.
Technology has made the whole process so much easier today. I am amazed more people do not do it. You can get a small rack-mount sheet metal pedestal with an AC power meter and an HVAC unit for under $2k. Being a utility will allow you to place that on a concrete pad or vault in the utility corridor (often without any monthly fee from the city or county). You place a few bollards around it so no one drives into it. You want to get quotes from some tier 1 providers [0]. They will help you identify the best locations to engineer an optical meet and those are the locations you run by the city/county/state utilities board or commission.
For a network engineer wanting to implement a fault tolerant network, you can place multiple pedestals at different locations on your provider's/peer's network to create a route diversified protected network.
After all, when you are buying expensive cloud based services that literally is all your cloud provider is doing ... just on a completely more massive scale. The barrier to entry is not as high as you might think. You have technology offerings like OpenStack [1], where multiple competitive vendors will also help you engineer a solution. The government also provides (financial) support [2].
The best perk is the number of parking spaces the requisite orange utility traffic cone opens up for you.
[0] https://en.wikipedia.org/wiki/Tier_1_network
[1] https://www.openstack.org/
[2] https://www.usda.gov/reconnect
4 replies →
You're missing "apartment complex" - you as the service provider contract with the apartment management company to basically cover your costs, and they handle the day-to-day along with running the apartment building.
Done right, it'll be cheaper for them (they can advertise "high speed internet included!" or whatever) and you won't have much to do assuming everything on your end just works.
The days where small ISPs provided things like email, web hosting, etc, are long gone; you're just providing a DHCP IP and potentially not even that if you roll out carrier-grade NAT.
1 reply →
> it is quite easy and cheap to register as a public telecommunication utility
Is North Carolina one of those states? I'm intrigued…
I have only done a few midwestern states. Call them and ask [0] - (919) 733-7328. You may want to first call your proposed county commissioner's office or city hall (if you are not rural), and ask them who to talk with about a new local business providing Internet service. If you can show the Utilities Commission that you are working with someone at the local level I have found they will treat you more seriously. In certain rural counties, you can even qualify for funding from the Rural Utilities Service of the USDA.
[0] https://www.ncuc.net/
EDIT: typos + also most states distinguish between facilities-based ISP's (ie with physical plant in the regulated public right-of-way) and other ISPs. Tell them you are looking to become a facilities-based ISP.
8 replies →
> I have been using arrays of Intel NUCs at some customer sites for years
Stares at the 3 NUCs on my desk waiting to be clustered for a local sandbox.
I don't understand the pedestal approach. Do you put your server in the pedestal, so the pedestal is in effect your data center?
I suppose a NUC or two will easily fit in there.
This is pretty devious and I love it.
I like the cut of your jib.
We have a different take on running "one big database." At ScyllaDB we prefer vertical scaling because you get better utilization of all your vCPUs, but we still will keep a replication factor of 3 to ensure that you can maintain [at least] quorum reads and writes.
So we would likely recommend running 3x big servers. For those who want to plan for failure, though, they might prefer to have 6x medium servers, because then the loss of any one means you don't take as much of a "torpedo hit" when any one server goes offline.
So it's a balance. You want to be big, but you don't want to be monolithic. You want an HA architecture so that no one node kills your entire business.
I also suggest that people planning systems create their own "torpedo test." We often benchmark to tell maximal optimum performance, presuming that everything is going to go right.
But people who are concerned about real-world outage planning may want to "torpedo" a node to see how a 2-out-of-3-nodes-up cluster operates, versus a 5-out-of-6-nodes-up cluster.
This is like planning for major jets, to see if you can work with 2 of 3 engines, or 1 of 2.
Obviously, if you have 1 engine, there is nothing you can do if you lose that single point of failure. At that point, you are updating your resume, and checking on the quality of your parachute.
> At that point, you are updating your resume, and checking on the quality of your parachute
The ordering of these events seems off but that's understandable considering we're talking about distributed systems.
I think this is the right approach, and I really admire the work you do at ScyllaDB. For something truly critical, you really do want to have multiple nodes available (at least 2, and probably 3 is better). However, you really should want to have backup copies in multiple datacenters, not just the one.
Today, if I were running something that absolutely needed to be up 24/7, I would run a 2x2 or 2x3 configuration with async replication between primary and backup sites.
Exactly. Regional distribution can be vital. Our customer Kiwi.com had a datacenter fire. 10 of their 30 nodes were turned to a slag heap of ash and metal. But 20 of 30 nodes in their cluster were in completely different datacenters so they lost zero data and kept running non-stop. This is a rare story, but you do NOT want to be one of the thousands of others that only had one datacenter, and their backups were also stored there and burned up with their main servers. Oof!
https://www.scylladb.com/2021/03/23/kiwi-com-nonstop-operati...
Well said. Caring about vertical scale doesn't mean you have to throw out a lot of the lessons learned about still being horizontally scalable or high availability.
Some comments wrongly equate bare-metal with on-premise. Bare-metal servers can be rented out, collocated, or installed on-premise.
Also, when renting, the company takes care of hardware failures. Furthermore, as hard disk failures are the most common issue, you can have hot spares and opt to let damaged disks rot, instead of replacing them.
For example, in ZFS, you can mirror disks 1 and 2, while having 3 and 4 as hot spares, with the following command:
---
The 400Gbps are now 700Gbps
https://twitter.com/DanRayburn/status/1519077127575855104
---
About the break even point:
Disregarding the security risks of multi-tenant cloud instances, bare-metal is more cost-effective once your cloud bill exceeds $3,000 per year, which is the cost of renting two bare-metal servers.
---
Here's how you can create a two-server infrastructure:
https://blog.uidrafter.com/freebsd-jails-network-setup
720Gb/s actually. Those last 20-30Gb/s were pretty hard fought :)
Yeah. Thank you!
My favorite summary of why not to use microservices is from Grug:
"grug wonder why big brain take hardest problem, factoring system correctly, and introduce network call too
seem very confusing to grug"
https://grugbrain.dev/#grug-on-microservices
IMO microservices primarily solve organizational problems, not technical problems.
They allow a team to release independently of other teams that have or want to make different risk/velocity tradeoffs. Also smaller units being released means fewer changes and likely fewer failed releases.
> Also smaller units being released means fewer changes and likely fewer failed releases.
The interfaces are the hard part, so you may have fewer internal failures but problems between services seem more likely.
2 replies →
I have been doing this for two decades. Let me tell you about bare metal.
Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.
We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.
But you can't just buy 900 servers. You always need more capacity, so you have to predict what your peak load will be, and buy for that. And you have to do it well in advance because it takes a long time to build and ship 900 servers and then assemble them, run burn-in, replace the duds, and prep the OS, firmware, software. And you have to do this every 3 years (minimum) because old hardware gets obsolete and slow, hardware dies, disks die, support contracts expire. But not all at once, because who knows what logistics problems you'd run into and possibly not get all the machines in time to make your projected peak load.
If back then you told me I could turn on 900 servers for 1 month and then turn them off, no planning, no 3 year capital outlay, no assembly, burn in, software configuration, hardware repair, etc etc, I'd call you crazy. Hosting providers existed but nobody could just give you 900 servers in an hour, nobody had that capacity.
And by the way: cloud prices are retail prices. Get on a savings plan or reserve some instances and the cost can be half. Spot instances are a quarter or less the price. Serverless is pennies on the dollar with no management overhead.
If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you, as it can take up to several days for some cloud vendors to get some hardware classes in some regions. And I pray you were doing daily disk snapshots, and can get your dead disks replaced quickly.
That sounds like you have burst load. Per the article, cloud away, great fit.
The point was most people don't have that and even their bursts can fit in a single server. This is my experience as well.
The thing that confuses me is, isn't every publicly accessible service bursty on a long timescale? Everything looks seasonal and predictable until you hit the front page of Reddit, and you don't know what day that will be. You don't decide how much traffic you get, the world does.
8 replies →
> I have been doing this for two decades. Let me tell you about bare metal.
> Back in the day we had 1,000 physical servers to run a large scale web app. 90% of that capacity was used only for two months. So we had to buy 900 servers just to make most of our money over two events in two seasons.
> We also had to have 900 servers because even one beefy machine has bandwidth and latency limits. Your network switch simply can't pump more than a set amount of traffic through its backplane or your NICs, and the OS may have piss-poor packet performance too. Lots of smaller machines allow easier scaling of network load.
I started working with real (bare metal) servers on real internet loads in 2004 and retired in 2019. While there's truth here, there's also missing information. In 2004, all my servers had 100M ethernet, but in 2019, all my new servers had 4x10G ethernet (2x public, 2x private), actually some of them had 6x, but with 2x unconnected, I dunno why. In the meantime, cpu, nics, and operating systems have improved such that if you're not getting line rate for full mtu packets, it's probably becsause your application uses a lot of cpu, or you've hit a pathological case in the OS (which happens, but if you're running 1000 servers, you've probably got someone to debug that).
If you still need 1000 beefy 10G servers, you've got a pretty formidable load, but splitting it up into many more smaller servers is asking for problems of different kinds. Otoh, if your load really scales to 10x for a month, and you're at that scale, cloud economics are going to work for you.
My seasonal loads were maybe 50% more than normal, but usage trends (and development trends) meant that the seasonal peak would become the new normal soon enough; cloud managing the peaks would help a bit, but buying for the peak and keeping it running for the growth was fine. Daily peaks were maybe 2-3x the off-peak usage, 5 or 6 days a week; a tightly managed cloud provisioning could reduce costs here, but probably not enough to compete with having bare metal for the full day.
Let me take you back to March, 2020. When millions of Americans woke up to find out there was a pandemic and they would be working from home now. Not a problem, I'll just call up our cloud provider and request more cloud compute. You join a queue of a thousand other customers calling in that morning for the exact same thing. A few hours on hold and the CSR tells you they aren't provisioning anymore compute resources. east-us is tapped out, central-europe tapped out hours ago, California got a clue and they already called to reserve so you can't have that either.
I use cloud all the time but there are also blackswan events where your IaaS can't do anymore for you.
I never had this problem on AWS though I did see some startups struggle with some more specialized instances. Are midsize companies actually running into issues with non-specialized compute on AWS?
2 replies →
That's a good point about cloud services being retail. My company gets a very large discount from one of the most well-known cloud providers. This is available to everybody - typically if you commit to 12 months of a minimum usage then you can get substantial discounts. What I know is so far everything we've migrated to the cloud has resulted in significantly reduced total costs, increased reliability, improved scalability, and is easier to enhance and remediate. Faster, cheaper, better - that's been a huge win for us!
The entire point of the article is that your dated example no longer applies: you can fit the vast majority of common loads on a single server now, they are this powerful.
Redundancy concerns are also addressed in the article.
> If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you
You are taking this a bit too literally. The article itself says one server (and backups). So "one" here just means a small number not literally no fallback/backup etc. (obviously... even people you disagree with are usually not morons)
> If you don't want to learn new things, buy one big server. I just pray it doesn't go down for you
There's intermediate ground here. Rent one big server, reserved instance. Cloudy in the sense that you get the benefits of the cloud provider's infrastructure skills and experience, and uptime, plus easy backup provisioning; non-cloudy in that you can just treat that one server instance like your own hardware, running (more or less) your own preferred OS/distro, with "traditional" services running on it (e.g. in our case: nginx, gitea, discourse, mantis, ssh)
> Hosting providers existed but nobody could just give you 900 servers in an hour, nobody had that capacity
> it can take up to several days for some cloud vendors to get some hardware classes in some regions.
I wonder how these two can be true at the same time…
i handled a 8x increase in traffic to my website from a youtuber reviewing our game, by increasing the cache timer and fixing the wiki creating session table entries for logged out users on a wiki that required accounts to edit it.
we already get multiple millions of page hits a months for this happened.
This server had 8 cores but 5 of them were reserved for the 10tb a month in bandwidth game servers running on the same machine.
If you needed 1,000 physical computers to run your webapp, you fucked up somewhere along the line.
I didn't want to write a top-level comment and I'm sure few people will see this, but I scrolled down very far in this thread and didn't see this point made anywhere:
The article focuses almost entirely on technical questions, but the technical considerations are secondary; the reason so many organizations prefer cloud services, VMs, and containers is to manage the challenges of scaling organizationally, not technically.
Giving every team the tools necessary to spin up small or experimental services greases the skids of a large or quickly growing organization. It's possible to set this up on rented servers, but it's an up front cost in time.
The article makes perfect sense for a mature public facing service with a lot of predictable usage, but the sweet spot for cloud services is sprawling organizations with lots of different teams doing lots of different mostly-internally facing things.
I agree with almost everything you said; except that the article offers extremely valuable advice for small startups going the cloud / rented VM route: Yearly payments, or approaching a salesperson, can lead to much lower costs.
(I should point out that yesterday, in Azure, I added a VM in a matter of seconds and it took all of 15 minutes to boot up and start running our code. My employer is far too small to have dedicated ops; the cost of cloud VMs is much cheaper than hiring another ops / devops / whatever.)
Yep. To be clear, I thought it was a great article with lots of great advice, just too focused on the technical aspects of cloud benefits, whereas I think the real value is organizational.
Interesting write-up that acknowledges the benefits of cloud computing while starkly demonstrating the value proposition of just one powerful, on-prem server. If it's accurate, I think a lot of people are underestimating the mark-up cloud providers charge for their services.
I think one of the major issues I have with moving to the cloud is a loss of sysadmin knowledge. The more locked in you become to the cloud, the more that knowledge atrophies within your organization. Which might be worth it to be nimble, but it's a vulnerability.
Given that AWS holds up the entire Amazon Company, and is a large part of Bezo's personal wealth, I think the market up is pretty good.
I like One Big (virtual) Server until you come to software updates. At a current project we have one server running the website in production. It runs an old version of Centos, the web server, MySQL and Elasticsearch all on the one machine.
No network RTTs when doing too many MySQL queries on each page - great! But when you want to upgrade one part of that stack... we end up cloning the server, upgrading it, testing everything, and then repeating the upgrade in-place on the production server.
I don't like that. I'd far rather have separate web, DB and Elasticsearch servers where each can be upgraded without fear of impacting the other services.
You could just run system containers (eg. lxd) for each component, but still on one server. That gets you multiple "servers" for the purposes of upgrades, but without the rest of the paradigm shift that Docker requires.
Which is great until there's a security vuln in an end-of-life piece of core software (the distro, the kernel, lxc, etc) and you need to upgrade the whole thing, and then it's a 4+ week slog of building a new server, testing the new software, fixing bugs, moving the apps, finding out you missed some stuff and moving that stuff, shutting down the old one. Better to occasionally upgrade/reinstall the whole thing with a script and get used to not making one-off changes on servers.
If I were to buy one big server, it would be as a hypervisor. Run Xen or something and that way I can spin up and down VMs as I choose, LVM+XFS for snapshots, logical disk management, RAID, etc. But at that point you're just becoming a personal cloud provider; might as well buy smaller VMs from the cloud with a savings plan, never have to deal with hardware, make complex changes with a single API call. Resizing an instance is one (maybe two?) API call. Or snapshot, create new instance, delete old instance: 3 API calls. Frickin' magic.
"the EC2 Instance Savings Plans offer up to 72% savings compared to On-Demand pricing on your Amazon EC2 Instances" - https://aws.amazon.com/savingsplans/
3 replies →
I use LXC a lot for our relatively small production setup. And yes, I'm treating the servers like pets, not cattle.
What's nice is that I can snapshot a container and move it to another physical machine. Handy for (manual) load balancing and upgrades to the physical infrastructure. It is also easy to run a snapshot of the entire server and then run an upgrade, then if the upgrade fails, you roll back to the old snapshot.
Doesn't the container help with versioning the software inside it, but you're still tied to the host computer's operating system, and so when you upgrade that you have to test every single container to see if anything broke?
Whereas if running a VM you have a lot more OS upgrades to do, but you can do them individually and they have no other impact?
This is the bit I've never understood with containers...
Even lxd has updates, many a times security updates.
Containers are your friend here. The sysadmin tools that have grown out of the cloud era are actually really helpful if you don't cloud too much.
Yup! Docker is probably the greatest language-agnostic tool a developer can add to their toolbox.
What about kernel updates on the host?
In the paper on Twitter’s “Who to Follow” service they mention that they designed the service around storing the entire twitter graph in the memory of a single node:
> An interesting design decision we made early in the Wtf project was to assume in-memory processing on a single server. At first, this may seem like an odd choice, run- ning counter to the prevailing wisdom of “scaling out” on cheap, commodity clusters instead of “scaling up” with more cores and more memory. This decision was driven by two rationales: first, because the alternative (a partitioned, dis- tributed graph processing engine) is significantly more com- plex and dicult to build, and, second, because we could! We elaborate on these two arguments below.
> Requiring the Twitter graph to reside completely in mem- ory is in line with the design of other high-performance web services that have high-throughput, low-latency require- ments. For example, it is well-known that Google’s web indexes are served from memory; database-backed services such as Twitter and Facebook require prodigious amounts of cache servers to operate smoothly, routinely achieving cache hit rates well above 99% and thus only occasionally require disk access to perform common operations. However, the additional limitation that the graph fits in memory on a single machine might seem excessively restrictive.
I always wondered if they still do this and if this influenced any other architectures at other companies.
Paper: https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69...
Yeah I think single machine has its place, and I once sped up a program by 10000x by just converting it to Cython and having it all fit in the CPU cache, but the cloud still does have a place! Even for non-bursty loads. Even for loads that theoretically could fit in a single big server.
Uptime.
Or are you going to go down as all your workers finish? Long connections? Etc.
It is way easier to gradually handover across multiple API servers as you do an upgrade than it is to figure out what to do with a single beefy machine.
I'm not saying it is always worth it, but I don't even think about the API servers when a deploy happens anymore.
Furthermore if you build your whole stack this way it will be non-distributed by default code. Easy to transition for some things, hell for others. Some access patterns or algorithms are fine when everything is in a CPU cache or memory but would fall over completely across multiple machines. Part of the nice part about starting with cloud first is that it is generally easier to scale to billions of people afterwards.
That said, I think the original article makes a nuanced case with several great points and I think your highlighting of the Twitter example is a good showcase for where single machine makes sense.
> 1 million IOPS on a NoSQL database
I have gone well beyond this figure by doing clever tricks in software and batching multiple transactions into IO blocks where feasible. If your average transaction is substantially smaller than the IO block size, then you are probably leaving a lot of throughput on the table.
The point I am trying to make is that even if you think "One Big Server" might have issues down the road, there are always some optimizations that can be made. Have some faith in the vertical.
This path has worked out really well for us over the last ~decade. New employees can pick things up much more quickly when you don't have to show them the equivalent of a nuclear reactor CAD drawing to get started.
> batching multiple transactions into IO blocks where feasible. If your average transaction is substantially smaller than the IO block size, then you are probably leaving a lot of throughput on the table.
Could you expand on this? A quick Google search didn't help. Link to an article or a brief explanation would be nice!
Sure. If you are using some micro-batched event processing abstraction, such as the LMAX Disruptor, you have an opportunity to take small batches of transactions and process them as a single unit to disk.
For event sourcing applications, multiple transactions can be coalesced into a single IO block & operation without much drama using this technique.
Surprisingly, this technique also lowers the amount of latency that any given user should experience, despite the fact that you are "blocking" multiple users to take advantage of small batching effects.
As per usual, don't copy Google if you don't have the same requirements. Google Search never goes down. HN goes down from time and nobody minds. Google serves tens (hundreds?) of thousands of queries per second. HN serves ten. HN is fine with one server because it's small. How big is your service going to be? Do that boring math :)
Correct. I like to ask "how much money do we lose if the site goes down for 1hr? a day?" etc.. and plan around that. If you are losing 1m an hour, or 50m if it goes down for a day, hell yeah you should spend a few million on making sure your site stays online!
But, it is amazing how often c-levels cannot answer this question!
Even Google search has gone down apparently, for five minutes in 2013:
https://www.cnet.com/tech/services-and-software/google-goes-...
There were huge availability issues as recent as December 14th 2020, for 45 minutes.
I think Elixir/Erlang is uniquely positioned to get more traction in the inevitable microservice/kubernetes backlash and the return to single server deploys (with a hot backup). Not only does it usually sip server resources but it also scales naturally as more cores/threads are available on a server.
Going from an Erlang "monolith" to a java/k8s cluster, I was amazed at how much more work it is takes to build a "modern" microservice. Erlang still feels like the future to me.
Can you imagine if even a fraction of the effort poured in to k8s tooling had gone in to the Erlang/OTP ecosystem instead?
This is the norm. It's only weird things like Node.js and Ruby that don't have this property.
While individual Node.js processes are single-threaded, Node.js includes a standard API that distributes its load across multiple processes, and therefor cores.
- https://nodejs.org/api/cluster.html#cluster
Don't be scared of 'one big server' for reliability. I'd bet that if you hired a big server today in a datacenter, the hardware will have more uptime than something cloud-native with az-failover hosted on AWS.
Just make sure you have a tested 30 minute restoration plan in case of permanent hardware failure. You'll probably only use it once every 50 years on average, but it will be an expensive event when it happens.
The way I code now after 10 years: Use one big file. No executable I'm capable of writing on my own is complex enough to need 50 files spread across a 3-layers-deep directory tree. Doesn't matter if it's a backend, a UI, or what. There's no way your React or whatever tutorial example code needs that either. And you don't gain any meaningful organization splitting into files when there are already namespaces, classes, structs, comments, etc. I don't want to waste time reorganizing it, dealing with imports, or jumping around different files while I code.
Oh, there's some custom lib I want to share between executables, like a Postgres client? Fine, it gets its own new file. Maybe I end up with 4 files in the end.
I like simplicity but this sounds pretty awful if you work in a team - file structure can help with the domain design too.
This is sorta how our team does things, and so far it hasn't presented issues. Each service has the vast majority of its real logic in a single file. Worst case, one day this stops working, and someone takes 10 minutes to split things into a separate file.
On the other side, I've seen people spend hours preemptively deciding on a file structure. It often stops making sense a month later, and every code review has a back and forth argument about what to name a new file.
I read it as satire.
Reminds me of a company I used to work at which took a similar approach. We used one file per person policy, each developer had their own file that contained functionality developed by them, named like firstName_lastName.ext - everyone owned their file so we didn't have to worry about merge conflicts.
On the team at my day job, it'd be very bad for each person to strictly "own" their code like that because things get handed off all the time, but in some other situations I can see it making sense.
I am using Firebase on a project and I regret it.
There are some Firebase specific annoyances to put up with, like the local emulator is not as nice and "isomorphic" as say running postgresql locally.
But the main problem (and I think this is shared by what I call loosely "distributed databases") is you have to think really hard about how the data is structured.
You can't structure it as nicely from a logical perspective compared to a relational DB. Because you can't join without pulling data from all over the place. Because the data isn't in one place. It is hard to do joins both in terms of performance and in terms of developer ergonomics.
I really miss SELECT A.X, B.Y FROM A JOIN B ON A.ID = B.AID; when using Firebase.
You have to make data storage decisions early on, and it is hard to change you mind later. It is hard to migrate (and may be expensive if you have a lot of existing data).
I picked Firebase for the wrong reason (I thought it would make MVP quicker to set up). But the conveniences it provides are outweighed by having to structure your data for distribution across servers.
Instead next time I would go relational, then when I hit a problem do that bit distributed. Most tables have 1000s of records. Maybe millions. The table with billions might need to go out to something distributed.
Market gap??:
Let me rent real servers, but expose it in a "serverless" "cloud-like" way, so I don't have to upgrade the OS and all that kind of stuff.
In my opinion the best argument for RDBMSs came, ironically, from Rick Houlihan, who was at that time devrel for DynamoDB. Paraphrasing from memory, he said "most data is relational, because relationships are what give data meaning, but relational databases don't scale."
Which, maybe if you're Amazon, RDBMSs don't scale. But for a pleb like me, I've never worked on a system even close the scaling limits of an RDBMS—Not even within an order of magnitude of what a beefy server can do.
DynamoDB, Firebase, etc. require me to denormalize data, shape it to conform to my access patterns—And pray that the access patterns don't change.
No. I think I'll take normalized data in an RDBMS, scaling be damned.
> Let me rent real servers, but expose it in a "serverless" "cloud-like" way, so I don't have to upgrade the OS and all that kind of stuff.
I think you're describing platform-as-a-service? It does exist, but it didn't eat cloud's lunch, rather the opposite I expect.
It's hard to sell a different service when most technical people in medium-big companies are at the mercy of non-technical people who just want things to be as normal as possible. I recently encountered this problem where even using Kubernetes wasn't enough, we had to use one of the big three, even though even sustained outages wouldn't be very harmful to our business model. What can I say, boss want cloud.
Yes, it's very hard to beat Postgres IMO. You can use Firebase without using its database, and you can certainly run a service with a Postgres database without having to rent out physical servers.
Maybe you would be interested in Supabase. It's what I moved to after having the same experience as you using Firebase
At various points in my career, I worked on Very Big Machines and on Swarms Of Tiny Machines (relative to the technology of their respective times). Both kind of sucked. Different reasons, but sucked nonetheless. I've come to believe that the best approach is generally somewhere in the middle - enough servers to ensure a sufficient level of protection against failure, but no more to minimize coordination costs and data movement. Even then there are exceptions. The key is don't run blindly toward the extremes. Your utility function is probably bell shaped, so you need to build at least a rudimentary model to explore the problem space and find the right balance.
Yes, totally.
Among the setups the one that I think is the golden is BIG Db Server, 1-4 front-end(web/api/cache) servers. Off-hand the backups and CDN.
That is.
Nope. Multiple small servers.
1) you need to get over the hump and build in multiple servers into your architecture from the get go (the author says you need two servers minimum), so really we are talking about two big servers.
2) having multiple small servers allows us to spread our service into different availability zones
3) multiple small servers allows us to do rolling deploys without bringing down our entire service
4) once we use the multiple small servers approach it’s easy to scale up and down our compute by adding or removing machines. Having one server it’s difficult to scale up or down without buying more machines. Small servers we can add incrementally but with the large server approach scaling up requires downtime and buying a new server.
The line of thinking you follow is what is plaguing this industry with too much complexity and simultaneously throwing away incredible CPU and PCIe performance gains in favor of using the network.
Any technical decisions about how many instances to have and how they should be spread out needs to start as a business decision and end in crisp numbers about recovery point/time objections, and yet somehow that nearly never happens.
To answer your points:
1) Not necessarily. You can stream data backups to remote storage and recover from that on a new single server as long as that recovery fits your Recovery Time Objective (RTO).
2) What's the benefit of multiple AZs if the SLA of a single AZ is greater than your intended availability goals? (Have you checked your provider's single AZ SLA?)
3) You can absolutely do rolling deploys on a single server.
4) Using one large server doesn't mean you can't compliment it with smaller servers on an as-needed basis. AWS even has a service for doing this.
Which is to say: there aren't any prescriptions when it comes to such decisions. Some businesses warrant your choices, the vast majority do not.
> Any technical decisions about how many instances to have and how they should be spread out needs to start as a business decision and end in crisp numbers about recovery point/time objections, and yet somehow that nearly never happens.
Nobody wants to admit that their business or their department actually has a SLA of "as soon as you can, maybe tomorrow, as long as it usually works". So everything is pretend-engineered to be fifteen nines of reliability (when in reality it sometimes explodes because of the "attempts" to make it robust).
Being honest about the actual requirements can be extremely helpful.
2 replies →
> simultaneously throwing away incredible CPU and PCIe performance gains
We really need to double down on this point. I worry that some developers believe they can defeat the laws of physics with clever protocols.
The amount of time it takes to round trip the network in the same datacenter is roughly 100,000 to 1,000,000 nanoseconds.
The amount of time it takes to round trip L1 cache is around half a nanosecond.
A trip down PCIe isn't much worse, relatively speaking. Maybe hundreds of nanoseconds.
Lots of assumptions and hand waving here, but L1 cache can be around 1,000,000x faster than going across the network. SIX orders of magnitude of performance are instantly sacrificed to the gods of basic physics the moment you decide to spread that SQLite instance across US-EAST-1. Sure, it might not wind up a million times slower on a relative basis, but you'll never get access to those zeroes again.
> 2) What's the benefit of multiple AZs if the SLA of a single AZ is greater than your intended availability goals? (Have you checked your provider's single AZ SLA?)
… my providers single AZ SLA is less than my company's intended availability goals.
(IMO our goals are also nuts, too, but it is what it is.)
Our provider, in the worse case (a VM using a managed hard disk) has an SLA of 95% within a month (I … think. Their SLA page uses incorrect units on the top line items. The examples in the legalese — examples are normative, right? — use a unit of % / mo…).
You're also assuming a provider a.) typically meets their SLAs and b.) if they don't, honors them. IME, (a) is highly service dependent, with some services being just stellar at it, and (b) is usually "they will if you can prove to them with your own metrics they had an outage, and push for a credit. Also (c.) the service doesn't fail in a way that's impactful, but not covered by SLA. (E.g., I had a cloud provider once whose SLA was over "the APIs should return 2xx", and the APIs during the outage, always returned "2xx, I'm processing your request". You then polled the API and got "2xx your request is pending". Nothing was happening, because they were having an outage, but that outage could continue indefinitely without impacting the SLA! That was a fun support call…)
There's also (d) AZs are a myth; I've seen multiple global outages. E.g., when something like the global authentication service falls over and takes basically every other service with it. (Because nothing can authenticate. What's even better is the provider then listing those services as "up" / not in an outage, because technically it's not that service that's down, it is just the authentication service. Cause God forbid you'd have to give out that credit. But the provider calling a service "up" that is failing 100% of the requests sent its way is just rich, from the customer's view.)
I agree! Our "distributed cloud database" just went down last night for a couple of HOURS. Well, not entirely down. But there were connection issues for hours.
Guess what never, never had this issue? The hardware I keep in a datacenter lol!
> The line of thinking you follow is what is plaguing this industry with too much complexity and simultaneously throwing away incredible CPU and PCIe performance gains in favor of using the network.
It will die out naturally once people realize how much the times have changed and that the old solutions based on weaker hardware are no longer optimal.
Ok, so to your points.
"It depends" is the correct answer to the question, but the least informative.
One Big Server or multiple small servers? It depends.
It always depends. There are many workloads where one big server is the perfect size. There are many workloads where many small servers are the perfect solution.
What my point is, is that the ideas put forward in the article are flawed for the vast majority of use cases.
I'm saying that multiple small servers are a better solution on a number of different axis.
For 1) "One Server (Plus a Backup) is Usually Plenty" Now I need some kind of remote storage streaming system and some kind of manual recovery, am I going to fail over to the backup (and so it needs to be as big as my "One server" or will I need to manually recover from my backup?
2) Yes it depends on your availability goals, but you get this as a side effect of having more than one small instance
3) Maybe I was ambiguous here. I don't just mean rolling deploys of code. I also mean changing the server code, restarting, upgrading and changing out the server. What happens when you migrate to a new server (when you scale up by purchasing a different box). Now we have a manual process that doesn't get executed very often and is bound to cause downtime.
4) Now we have "Use one Big Server - and a bunch of small ones"
I'm going to add a final point on reliability. By far the biggest risk factor for reliability is me the engineer. I'm responsible for bringing down my own infra way more than any software bug or hardware issue. The probability of me messing up everything when there is one server that everything depends on is much much higher, speaking from experience.
So. Like I said, I could have said "It depends" but instead I tried to give a response that was someway illuminating and helpful, especially given the strong opinions expressed in the article.
I'll give a little color with the current setup for a site I run.
moustachecoffeeclub.com runs on ECS
I have 2 on-demand instances and 3 spot instances
One tiny instance running my caches (redis, memcache) One "permanent" small instance running my web server
Two small spot instances running web server One small spot instance running background jobs
small being about 3 GB and 1024 CPU units
And an RDS instance with backup about $67 / month
All in I'm well under $200 per month including database.
So you can do multiple small servers inexpensively.
Another aspect is that I appreciate being able to go on vacation for a couple of weeks, go camping or take a plane flight without worrying if my one server is going to fall over when I'm away and my site is going to be down for a week. In a big company maybe there is someone paid to monitor this, but with a small company I could come back to a smoking hulk of a company and that wouldn't be fun.
2 replies →
> you need to get over the hump and build in multiple servers into your architecture from the get go (the author says you need two servers minimum), so really we are talking about two big servers.
Managing a handful of big servers can be done manually if needed - it's not pretty but it works and people have been doing it just fine before the cloud came along. If you intentionally plan on having dozens/hundreds of small servers, manual management becomes unsustainable and now you need a control plane such as Kubernetes, and all the complexity and failure modes it brings.
> having multiple small servers allows us to spread our service into different availability zones
So will 2 big servers in different AZs (whether cloud AZs or old-school hosting providers such as OVH).
> multiple small servers allows us to do rolling deploys without bringing down our entire service
Nothing prevents you from starting multiple instances of your app on one big server nor doing rolling deploys with big bare-metal assuming one server can handle the peak load (so you take out your first server out of the LB, upgrade it, put it back in the LB, then do the same for the second and so on).
> once we use the multiple small servers approach it’s easy to scale up and down our compute by adding or removing machines. Having one server it’s difficult to scale up or down without buying more machines. Small servers we can add incrementally but with the large server approach scaling up requires downtime and buying a new server.
True but the cost premium of the cloud often offsets the savings of autoscaling. A bare-metal capable of handling peak load is often cheaper than your autoscaling stack at low load, therefore you can just overprovision to always meet peak load and still come out ahead.
I manage hundreds of servers, and use Ansible. It's simple and it gets the job done. I tried to install Kubernetes on a cluster and couldn't get it to work. I mean I know it works, obviously, but I could not figure it out and decided to stay with what works for me.
3 replies →
On a big server, you would probably be running VMs rather than serving directly. And then it becomes easy to do most of what you're talking about - the big server is just a pool of resources from which to make small, single purpose VMs as you need them.
Why VMs when you can use containers?
6 replies →
It completely depends on what you doing. This was pointed out in the first paragraph of the article:
> By thinking about the real operational considerations of our systems, we can get some insight into whether we actually need distributed systems for most things.
I'm building an app with Cloudflare serverless and you can emulate everything locally with a single command and debug directly... It's pretty amazing.
But the way their offerings are structured means it will be quite expensive to run at scale without a multi cloud setup. You can't globally cache the results of a worker function in CDN, so any call to a semi dynamic endpoint incurs one paid invocation, and there's no mechanism to bypass this via CDN caching because the workers live in front of the CDN, not behind it.
Despite their media towards lowering cloud costs, they have explicitly designed their products to contain people in a cost structure similar to but different than via egress fees. And in fact it's quite easily bypassed by using a non Cloudflare CDN in front of Cloudflare serverless.
Anyway, I reached a similar conclusion that for my app a single large server instance works best. And actually I can fit my whole dataset in RAM, so disk/JSON storage and load on startup is even simpler than trying to use multiple systems and databases.
Further, can run this on a laptop for effectively free, and cache everything via CDN, rather than pay ~$100/month for a cloud instance.
When you're small, development time is going to be your biggest constraint, and I highly advocate all new projects start with a monolithic approach, though with a structure that's conducive to decoupling later.
As someone who has only dabbled with serverless (Azure functions), the difficulty in setting up a local dev environment was something I found really off-putting. There is no way I am hooking up my credit card to test something that is still in development. It just seems crazy to me. Glad to hear Cloudflare workers provides a better experience. Does it provide any support for mocking commonly used services?
Yes, you can run your entire serverless infrastructure locally with a single command and close to 0 config.
It's far superior to other cloud offerings in that respect.
You can even run it live in dev mode and remote debug the code. Check out miniflare/Wrangler v2
Just wish they would have ability for persistent objects. Everything is still request driven, yet I want to schedule things on subminute schedules. You can do it today, but it requires hacks
I'm not sure if you know this, and it might not be useful to you even if you do, but workers can interact with the cache directly: https://developers.cloudflare.com/workers/runtime-apis/cache...
Yes, but the worker is in front of the cache (have to pay for an invocation even if cached), and the worker only interacts with the closest cache edge node, not the entire CDN.
But yeah, there are a few hacky ways to work around things. You could have two different URLs and have the client check if the item is stale, if so, call the worker which updates it.
I'm doing something similar with durable objects. I can get it to be persistent by having a cron that calls it every minute and then setting an alarm loop within the object.
It's just super awkward. It feels like a design decision to drive monetization. Cloudflare would be perfect if they let you have a persistent durable object instance that could update global CDN content
It's still the best serverless dev experience for me. Can do everything via JS while having transactional guarantees and globally distributed data right at the edge
2 replies →
One of first experiences in my professional career was situation when "one big server" that was serving the system that was making money actually failed on Friday, HP's warranty was like next or 2 business days to get a replacement.
The entire situation ended up having conference call with multiple department directors who were deciding which server from other systems to cannibalize (even if it is underpowered) to get the system going.
Since that time I'm quite skeptical about "one", and to me this is one of big benefits of cloud provides, as, most likely, there is another instance and stockouts are more rare.
The article is really talking about one big server plus a backup vs. cloud providers.
Science advances as RAM on a single machine increases.
For many years, genomics software was non-parallel and depending on having a lot of RAM- often a terabyte or more- to store data in big hash tables. Converting that to distributed computing was a major effort and to this day many people still just get a Big Server With Lots of Cores, RAM, and SSD.
Personally after many years of working wiht distributed, I absolutely enjoy working on a big fat server that I have all to myself.
> Science advances as RAM on a single machine increases.
Also as people learn that correlation does not equal causation. ;)
On the other hand in science, it sure is annoying that the size of problems that fit in a single node is always increasing. PARDISO running on a single node will always be nipping at your heels if you are designing a distributed linear system solver...
As someone who's worked in cloud sales and no longer has any skin in the game, I've seen firsthand how cloud native architectures improve developer velocity, offer enhanced reliability and availability, and actually decrease lock-in over time.
Every customer I worked with who had one of these huge servers introduced coupling and state in some unpleasant way. They were locked in to persisted state, and couldn't scale out to handle variable load even if they wanted to. Beyond that, hardware utilization became contentious at any mid-enterprise scale. Everyone views the resource pool as theirs, and organizational initiatives often push people towards consuming the same types of resources.
When it came time to scale out or do international expansion, every single one of my customers who had adopted this strategy had assumptions baked into their access patterns that made sense given their single server. When it came time to store some part of the state in a way that made sense for geographically distributed consumers, it was months not sprints of time spent figuring out how to hammer this in to a model that's fundamentally at odds.
From a reliability and availability standpoint, I'd often see customers tell me that 'we're highly available within a single data center' or 'we're split across X data centers' without considering the shared failure modes that each of these data centers had. Would a fiber outage knock out both of your DCs? Would a natural disaster likely knock something over? How about _power grids_? People often don't realize the failure modes they've already accepted.
This is obviously not true for every workload. It's tech, there are tradeoffs you're making. But I would strongly caution any company that expects large growth against sitting on a single-server model for very long.
Could confirmation bias affect your analysis at all?
How many companies went cloud-first and then ran out of money? You wouldn't necessary know anything about them.
Were the scaling problems your single-server customers called you to solve unpleasant enough put their core business in danger? Or was the expense just a rounding error for them?
From this and the other comment, it looks like I wasn't clear about talking about SMB/ME rather than a seed/pre-seed startup, which I understand can be confusing given that we're on HN.
I can tell you that I've never seen a company run out of money from going cloud-first (sample size of over 200 that I worked with directly). I did see multiple businesses scale down their consumption to near-zero and ride out the pandemic.
The answer to scaling problems being unpleasant enough to put the business in danger is yes, but that was also during the pandemic when companies needed to make pivots to slightly different markets. Doing this was often unaffordable from an implementation cost perspective at the time when it had to happen. I've seen acquisitions fall through due to an inability to meet technical requirements because of stateful monstrosities. I've also seen top-line revenue get severely impacted when resource contention causes outages.
The only times I've seen 'cloud-native' truly backfire were when companies didn't have the technical experience to move forward with these initiatives in-house. There are a lot of partners in the cloud implementation ecosystem who will fleece you for everything you have. One such example was a k8s microservices shop with a single contract developer managing the infra and a partner doing the heavy lifting. The partner gave them the spiel on how cloud-native provides flexibility and allows for reduced opex and the customer was very into it. They stored images in a RDBMS. Their database costs were almost 10% of the company's operating expenses by the time the customer noticed that something was wrong.
The common element in the above is scaling and reliability. While lots of startups and companies are focused on the 1% chance that they are the next Google or Shopify, the reality is that nearly all aren't, and the overengineering and redundancy-first model that cloud pushes does cost them a lot of runway.
It's even less useful for large companies; there is no world in which Kellogg is going to increase sales by 100x, or even 10x.
But most companies aren't startups. Many companies are established, growing businesses with a need to be able to easily implement new initiatives and products.
The benefits of cloud for LE are completely different. I'm happy to break down why, but I addressed the smb and mid-enterprise space here because most large enterprises already know they shouldn't run on a single rack.
2 replies →
Recent team I was on used one big server.
Wound up spawning off a separate thread from our would-be stateless web api to run recurring bulk processing jobs.
Then coupled our web api to the global singleton-esque bulk processing jobs thread in a stateful manner.
The wrapped actors up on actors on top of everything to try to wring as much performance as possible out of the big server.
Then decided they wanted to have a failover/backup server but it was too difficult due to the coupling to the global singleton-esque bulk processing job.
[I resigned at this point.]
So yeah color me skeptical. I know every project's needs are different, but I'm a huge fan of dumping my code into some cloud host that auto-scaled horizontally, and then getting back to writing more code that provides some freeeking busines value.
Hybrid!
If you are at all cost sensitive, you should have some of your own infrastructure, some rented, and some cloud.
You should design your stuff to be relatively easily moved and scaled between these. Build with docker and kubernetes and that's pretty easy to do.
As your company grows, the infrastructure team can schedule which jobs run where, and get more computation done for less money than just running everything in AWS, and without the scaling headaches of on-site stuff.
Clouds potentially make this an expensive option because of their silly egress bandwidth fees.
This post raises small issues like reliability, but missed lot of much bigger issues like testing, upgrades, reproducibility, backups and even deployments. Also, the author is comparing on demand pricing, which to me doesn't make sense if you are paying for the server with reserved pricing. Still I agree there would be a difference of 2-3x(unless your price is dominated by AWS egress fees), but most server with fixed workload, even for very popular but simple sites, it could be done in $1k/month in cloud, less than 10% of one developer salary. For non fixed workload like ML training, you would anyways need some cloudy setup.
One thing that has helped me grow over the last few years building startups is: microservices software architecture and microservice deployment are two different things.
You can logically break down your software into DDD bounded contexts and have each own its data, but that doesn't mean you need to do Kubernetes with Kafka and dozens of tiny database instances, communicating via json/grpc. You can have each "service" live in its own thread/process, have it's own database (in the "CREATE DATABASE" sense, not the instance sense), communicate via a simple in-memory message queue, and communicate through "interfaces" native to your programming language.
Of course it has its disadvantages (need to commit to a single softare stack, still might need a distributed message queue if you want load balancing, etc) but for the "boring business applications" I've been implementing (where DDD/logical microservices makes sense) it has been very useful.
I didn’t see a point of cloudy services being easier to manage. If some team gets a capital budget to buy that one big server, they will put every thing on it, no matter your architectural standards. Cron jobs editing state on disk, tmux sessions shared between teams, random web servers doing who knows what, non-DBA team Postgres installs, etc. at least in cloud you can limit certain features and do charge back calculations.
Not sure if that is a net win for cloud or physical, of course, but I think it is a factor
One of our projects uses 1 big server and indeed, everyone started putting everything on it (because it's powerful): the project itself, a bunch of corporate sites, a code review tool, and god knows what else. Last week we started having issues with the projects going down because something is overloading the system and they still can't find out what exactly without stopping services/moving them to a different machine (fortunately, it's internal corporate stuff, not user-facing systems). The main problem I've found with this setup is that random stuff can accumulate with time and then one tool/process/project/service going out of control can bring down the whole machine. If it's N small machines, there's greater isolation.
I believe that the "one big server" is intended for an application rather than trying to run 500 applications.
Does your application run on a single server? If yes. Don't use a distributed system for it's architecture or design. Simply buy bigger hardware when necessary. Because the top end of servers are insanely big and fast.
It does not mean, IMHO, throw everything on a single system without suitable organization, oversight, isolation, and recovery plans.
It sounds like you need some containers.
I don't agree with EVERYTHING in the article such as getting 2 big rather than multiple smaller, this is really just a cost/requirement issue though.
The biggest cost I've noticed with enterprises who go full cloud is that they are locked in for the long term. I don't mean contractually though, basically the way they design and implement any system or service MUST follow the providers "way" this can be very detrimental for leaving the provider or god forbid the provider decides to sunset certain service versions etc.
That said, for enterprise it can make a lot of sense and the article covers it well by admitting some "clouds" are beneficial.
For anything I've ever done outside of large businesses the go to has always been "if it doesn't require a SRE to maintain, just host your own".
> Why Should I Pay for Peak Load? [...] someone in that supply chain is charging you based on their peak load
Oh it's even worse than that: this someone oversubscribe your hardware a little during your peak and a lot during your trough, padding their great margins at the expense of extra cache misses/perf degradation of your software that most of the time you won't notice if they do their job well.
This is one of the reasons why large companies such as my employer (Netflix) are able to invest into their own compute platforms to reclaim some of these gains back, so that any oversubscription & collocation gains materialize into a lower cloud bill - instead of having your spare CPU cycles be funneled to a random co-tenant customer of your cloud provider, the latter capturing the extra value.
A consequence of one-big-server is decreased security. You become discouraged from applying patches because you must reboot. Also if one part of the system is compromised, every service is now compromised.
Microservices on distinct systems offer damage control.
> In comparison, buying servers takes about 8 months to break even compared to using cloud servers, and 30 months to break even compared to renting.
Can anyone help me understand why the cloud/renting is still this expensive? I'm not familiar with this area, but it seems to me that big data centers must have some pretty big cost-saving advantages (maintenance? heat management?). And there are several major providers all competing in a thriving marketplace, so I would expect that to drive the cost down. How can it still be so much cheaper to run your own on-prem server?
Several points:
- The price for on-prem conveniently omits costs for power, cooling, networking, insurance and building space, it's only the purchase price.
- The price for the cloud server includes (your share of) the costs of replacing a broken power supply or hard drive, which is not included in the list price for on-prem. You will have to make sure enough of your devs know how to do that or else hire a few sysadmin types.
- As the article already mentions, the cloud has to provision for peak usage instead of average usage. If you buy an on-prem server you always have the same amount of computing power available and can't scale up quickly if you need 5x the capacity because of a big event. That kind of flexibility costs money.
Thank you, that explains it.
Not included in the break even calculation was the cost of colocation, or the cost of hiring someone to make sure the computer is in working order, or the less hassle upon hardware failures.
Also, as the author even mention in an article, a modern server basically obsoletes a 10 year old server. So you're going to have to replace your server at least every 10 years. So the break even in the case of renting makes sense when you consider that the server depreciates really quickly.
The huge capital required to get a data center with those cost savings serves as a nice moat to let people price things high.
Renting is not very expensive. 30 months is a large share of a computer's lifetime, and you are paying for space, electricity, and internet access too.
You're paying a premium for flexibility. If you don't need that then there are far cheaper options like some managed hosting from your local datacenter.
I didn't see the COST paper linked anywhere in this thread [0].
Excerpt from abstract:
We offer a new metric for big data platforms, COST, or the Configuration that Outperforms a Single Thread. The COST of a given platform for a given problem is the hardware configuration required before the platform outperforms a competent single-threaded implementation.
[0] https://www.usenix.org/conference/hotos15/workshop-program/p...
Last year I did some consulting for a client using Google cloud services such as Spanner and cloud storage. Storing and indexing mostly timeseries data with a custom index for specific types of queries. It was difficult for them to define a schema to handle the write bandwidth needed for their ingestion. In particular it required a careful hashing scheme to balance load across shards of the various tables. (It seems to be a pattern with many databases to suck at append-often, read-very-often patterns, like logs).
We designed some custom in-memory data structures in Java but also also some of the standard high-performance concurrent data structures. Some reader/write locks. gRPC and some pub/sub to get updates on the order of a few hundred or thousand qps. In the end, we ended up with JVM instances that had memory requirements in the 10GB range. Replicate that 3-4x for failover, and we could serve queries at higher rates and lower latency than hitting Spanner. The main thing cloud was good for was the storage of the underlying timeseries data (600GB maybe?) for fast server startup, so that they could load the index off disk in less than a minute. We designed a custom binary disk format to make that blazingly fast, and then just threw binary files into a cloud filesystem.
If you need to serve < 100GB of data and most of it is static...IMHO, screw the cloud, use a big server and replicate it for fail-over. Unless you got really high write rates or have seriously stringent transactional requirements, then man, a couple servers will do it.
YMMV, but holy crap, servers are huge these days.
I find disk io to be a primary reason to go with bare metal. The vm abstractions just kill io performance. In a single server you can fill up the PCI lanes with flash and hit some ridiculous throughput numbers.
When you say “screw the cloud”, you mean “administer an EC2 machine yourself” or really “buy your own hardware”?
The former, mostly. You don't necessarily have to use EC2, but that's easy to do. There are many other, smaller providers if you really want to get out from under the big 3. I have no experience managing hardware, so I personally wouldn't take that on myself.
Currently using two old computers as servers in my homelab: 200 GE Athlons with 35 W TDP, ~20 GB of value RAM (can't afford ECC), a few 1TB HDDs. As CI servers and test nodes for running containers, they're pretty great, as well as nodes for pulling backups from any remote servers (apart from the ECC aspect), or even something to double as a NAS (on separate drives).
I actually did some quick maths and it would appear that a similar setup on AWS would cost over 600$ per month, Azure, GCP and others also being similarly expensive, which I just couldn't afford.
Currently running a few smaller VPSes on Time4VPS as well (though Hetzner is also great), for the stuff that needs better availability and better networking. Would I want everything on a single server? Probably not, because that would mean needing something a bit better than a homelab setup behind a residential Internet connection (even if parts of it can be exposed to the Internet through a cheap VPS as a proxy, a la Cloudflare).
Either way, I appreciate the sentiment!
One thing to keep in mind is separation. The prod environment should be completely separated from the dev ones (plural, it should be cheap/fast to spin up dev environments). Access to production data should be limited to those that need it (ideally for just the time they need it). Teams should be able to deploy their app separately and not have to share dependencies (i.e operating system libraries) and it should be possible to test OS upgrades (containers do not make you immune from this). It's kinda possible to sort of do this with 'one big server' but then you're running your own virtualized infrastructure which has it's own costs/pains.
Definitely also don't recommend one big database, as that becomes a hairball quickly - it's possible to have several logical databases for one physical 'database 'server' though.
people don't account for the cpu & wall-time cost of encode-decode. I've seen it take up 70% of cpu on a fleet. That means 700/1000 servers are just doing encode decode.
You can see high efficiency setups like stackexchange & hackernews are orders of magnitude more efficient.
This is exactly correct. If you have a microservice running a Rest API, you are probably spending most of your CPU time on HTTP and JSON handling.
Not to be nasty, but we used to call them mainframes. A mainframe is still a perfectly good solution if you need five nines of uptime, with transparent failover of pretty much every part of the machine, the absolute fastest single-thread performance and the most transaction throughput per million dollars in the market.
I would not advise anyone to run them as a single machine, however, but to have it partitioned into smaller slices (they call them LPARs) and host lots of VMs in there (you can oversubscribe like crazy on those machines).
Managing a single box is cheaper, even if you have a thousand little goldfish servers in there (remember: cattle, not pets) and this is something the article only touches lightly.
"absolute fastest single-thread performance"
Can you provide citation?
The author missed the most important factor why cloud is dominating the world today. It is never about the actual hardware cost. It is the cost of educating people be able to use that big server. I can guarantee you you will need to pay at least $40k a month to hire someone to be able to write and deploy softwares that can actually realize the performance he claims on that big server. And your chance to be able to find one in 2 month is closed to 0, at least in today’s job market. Also even if you find one , he can leave you in one year to some others places, and your business will be dead.
10 years ago I had a site running on an 8GB of ram VM ($80/mo?) that ran a site serving over 200K daily active users on a completely dynamic site written in PHP running MySQL locally. Super fast and never went down!
Could you share how long you maintained this website? No problem with the db (schema updates, backups, replication, etc...)? No problem with your app updates (downtime, dependencies updates, code updates)? Did you work alone, or with a team? Did you setup a CI/CD? ...
I wrote down some questions, but in fact I just think it would be interesting to understand what was your setup in a bit more detailed fashion. You probably made some concessions and it seems they worked well for you. Would be interesting to know which ones!
Thanks
Yeah, I've been saying this for a long long time now, an early blog post of mine http://drupal4hu.com/node/305.html and this madness just got worse because of Kubernetes et al. Kubernetes is a Google solution. Are you sure Google-sized solutions are right for your organization?
Also, an equally pseudo controversial viewpoint: it's almost always cheaper to be down than engineering a HA architecture. Take a realistic look at downtime causes outside of your control -- for example, your DDoS shield provider going down etc. etc. and then consider how much downtime a hardware failure adds and now think. Maybe a manual failover master-slave is enough or perhaps even that's overkill? How much money does the business lose by being down versus how much it costs to protect from it? And can you really protect from it? Are you going to have regular drills to practice the failover -- and absurdly, will the inevitable downtime from failing a few of those be larger than a single server downtime? I rarely see posts about weighing these while the general advice of avoiding single points of failure -- which is very hard -- is abundant.
I'm a huge advocate of cloud services, and have been since 2007 (not sure where this guy got 2010 as the start of the "cloud revolution"). That out of the way, there is something to be said for starting off with a monolith on a single beefy server. You'll definitely iterate faster.
Where you'll get into trouble is if you get popular quickly. You may run into scaling issues early on, and then have to scramble to scale. It's just a tradeoff you have to consider when starting your project -- iterate quickly early and then scramble to scale, or start off more slowly but have a better ramping up story.
One other nitpick I had is that OP complains that even in the cloud you still have to pay for peak load, but while that's strictly true, it's amortized over so many customers that you really aren't paying for it unless you're very large. The more you take advantage of auto-scaling, the less of the peak load you're paying. The customers who aren't auto-scaling are the ones who are covering most of that cost.
You can run a pretty sizable business in the free tier on AWS and let everyone else subsidize your peak (and base!) costs.
Isn't this simplistic?
It really depends on the service, how it is used, the shape of the data generated/consumed, what type of queries are needed, etc.
I've worked for a startup that hit scaling issues with ~50 customers. And have seen services with +million users on a single machine.
And what does "quickly" and "popular" even mean? It also depends a lot on the context. We need to start discussing about mental models for developers to think of scaling in a contextual way.
> Where you'll get into trouble is if you get popular quickly. You may run into scaling issues early on
Did it ever occur to you that you can still use the cloud for on demand scaling? =)
Sure but only if you architect it that way, which most people don't if they're using one big beefy server, because the whole reason they're doing that is to iterate quickly. It's hard to build something that can bust to the cloud while moving quickly.
Also, the biggest issue is where your data is. If you want to bust to the cloud, you'll probably need a copy of your data in the cloud. Now you aren't saving all that much money anymore and adding in architectural overhead. If you're going to bust to the cloud, you might as well just build in the cloud. :)
It was all good, until NUMA came, and now you have to careful rethought your process, or you get lots of performance issues in your (otherwise) well threaded code. Speaking from first-hand experience, when our level editor ended up being used by artists on a server class machine, and supposedly 4x faster machine was actually going 2x slower (why, lots of std::shared_ptr<> use on our side, or any atomic reference counting) caused slowdowns, as the cache (my understanding) had to be synchronized between the two physical CPUs each having 12 threads.
But really not the only issue, just pointing out - that you can't expect everything to scale smoothly there, unless well thought, like ask your OS to allocate your threads/memory only on one of the physical CPUS (and their threads), and somehow big disconnected part of your process(es) on the other one(s), and make sure the communication between them is minimal.. which actually wants micro-services design again at that level.
so why not go with micro-services instead...
> The big drawback of using a single big server is availability. Your server is going to need downtime, and it is going to break. Running a primary and a backup server is usually enough, keeping them in different datacenters.
What about replication? I assume the 70k postgres IOPS fall to the floor when needing to replicate the primary database to a backup server in a different region.
Great article overall with many good points worth considering. Nothing is one size fits all so I won't get into the crux of the article: "just get one big server". I recently posted a comment breaking down the math for my situation:
https://www.section179.org/section_179_leases/
It blows my mind people are spending $2000+ per month for a server they can get used for $4000-5000 one time only cost.
VMWare + Synology Business Backup + Synology C2 backup is our way of doing business and never failed us for over 7 years. Why do people spend so much money for cloud while they can host it themselves less than 5% of the cost? (2 year usage assumed).
I've tried it all except this, including renting bare metal. Nowadays I'm in the cloud but not cloudy camp. Still, I'm intrigued.
Apart from the $4-5k server, what are your running costs? Licenses? Colocation? Network?
https://www.he.net/colocation.html
They have been around forever and their $400 deal is good, but that is for 42U, 1G and only 15 amps. With beefier servers, you will need more current (both BW and amperage) if you intend on filling the rack.
>Use the Cloud, but don’t be too Cloudy
The number of applications I have inherited that were messes falling apart at the seams because of misguided attempts to avoid "vendor lockin" with the cloud can not be understated. There is something I find ironic about people paying to use a platform but not using it because they feel like using it too much will make them feel compelled to stay there. Its basically starving yourself so you don't get too familiar with eating regularly.
Kids this PSA is for you. Auto Scaling Groups are just fine as are all the other "Cloud Native" services. Most business partners will tell you a dollar of growth is worth 5x-10x the value of a dollar of savings. Building a huge tall computer will be cheaper but if it isn't 10x cheaper(And that is Total Cost of Ownership not the cost of the metal) and you are moving more slowly than you otherwise would its almost a certainty you are leaving money on the table.
Aggressively avoiding lock-in is something I've never quite understood. Unless your provider of choice is also your competitor (like Spotify with Amazon) it shouldn't really be a problem. I'm not saying I'm a die hard cloud fan in all aspects but if you're going with it you may as well use it. Typically trying to avoid vendor lockin really ends up more expensive in the long run, you start avoiding the cheaper services (lambda for background job processing) for what may never really be a problem.
The one place I can see avoiding vendor lock-in as really useful is it often makes running things locally much easier. You're kind of screwed if you want to properly run something locally that uses SQS, DynamoDB, and Lambda. But that said, I think this is often better thought of as "keep my system simple" rather than "avoid vendor lock-in" as it focuses on the valuable side rather than the theoretical side.
> If you compare to the OVHCloud rental price for the same server, the price premium of buying your compute through AWS lambda is a factor of 25
and there is a factor of 25 that ovh is not a company where you should rent servers:
https://www.google.com/search?q=ovh+fire
The whole argument comes down to bursty vs. non-bursty workloads. What type of workloads make up the fat part of the distribution? If most use cases are bursty (which I would argue they are) then the author's argument only applies for specific applications. Therefore, most people do indeed see cost benefits from the cloud.
Reading these comments make me sad. It's like everyone has forgotten the cookie cutter server architecture pattern.
https://dzone.com/articles/monoliths-cookie-cutter-or
I really don't understand microservices for most businesses. They're great if you put the effort into it but most business don't have the scale required.
Big databases and big servers serve most businesses just fine. And past that NFS and other distributed filesystem approaches get you to the next phase by horizontally scaling your app servers without needing to decompose your business logic into microservices.
The best approach I've ever seen is a monorepo codebase with non-micro services built into it all running the same way across every app server with a big loadbalancer in front of it all.
No thanks. I have a few hobby sites, a personal vanity page, and some basic CPU expensive services that I use.
Moving to Aws server-less has saved me so much headache with system updates, certificate management, archival and backup, networking, and so much more. Not to mention with my low-but-spikey load, my breakeven is a long way off.
One-big-VM is another approach...
A big benefit is some providers will let you resize the VM bigger as you grow. The behind-the-scenes implementation is they migrate your VM to another machine with near-zero downtime. Pretty cool tech, and takes away a big disadvantage of bare metal which is growth pains.
that's why letsencrypt use a single database on a powerful server https://letsencrypt.org/2021/01/21/next-gen-database-servers...
I've started augmenting one big server with iCloud (CloudKit) storage, specifically syncing local Realm DBs to the user's own iCloud storage. Which means I can avoid taking custody of PII/problematic data, can include non-custodial privacy in product value/marketing, and means I can charge enough of a premium for the one big server to keep it affordable. I know how to scale servers in and out, so I feel the value of avoiding all that complexity. This is a business approach that leans into that, with a way to keep the business growing with domain complexity/scope/adoption (iCloud storage, probably other good APIs like this to work with along similar lines).
I would think that it can hold 1TB of RAM _per_socket_ (with 64GB DIMM), so _2TB_ total.
> Populated with specialized high-capacity DIMMs (which are generally slower than the smaller DIMMs), this server supports up to 8 TB of memory total.
At work we're building a measurement system for wind tunnel experiments, which should be able to sustain 500 MB/sec for minutes on end, preferably while simultaneously reading/writing from/to disk for data format conversion. We bought a server with 1TB of RAM, but I wonder how much slower these high-capacity DIMMs are. Can anyone point me to information regarding latency and throughput? More RAM for disk caching might be something to look at.
One can embrace this philosophy for one's personal computing too http://catern.com/computers.html although it's not for everyone
I am using a semi big cloud VPS to host all my live services. It's 'just' a few thousand users per day over 10+ websites.
The combination of Postgres, Nginx and Passenger & Cloudflare make this a easy experience. The cloud (In this case Vultr) allows on demand scaling, backups and so far I've had zero downtime because of them.
In the past I've run a mixture of cloud and some dedicated servers and since migrating I have less downtime and way less work and no worse load times.
Being too cloudy without being too cloudy, as per the article, I've gone with a full stack in containers under Docker Compose one one EC2 server, including the database. Services are still logically separated and have a robust CI/CD set up but the cost is a 3rd of what an ECS set up with load balancers and RDS for the database would have been. It's also simpler. Have scripted the server set up, with regular back ups / snapshots but admit I would like db replication in there.
If you're hosting on-prem then you have a cluster to configure and manage, you have multiple data centers you need to provision, you need data backups you have to manage plus the storage required for all those backups. Data centers also require power, cooling, real estate taxes, administration - and you need at least two of them to handle systemic outages. Now you have to manage and coordinate your data between those data centers. None of this is impossible of course, companies have been doing this everyday for decades now. But let's not pretend it doesn't all have a cost - and unless your business is running a data center, none of these costs are aligned with your business' core mission.
If you're running a start-up it's pretty much a no-brainer you're going to start off in the cloud.
What's the real criteria to evaluate on-prem versus the cloud? Load consistency. As the article notes, serverless cloud architectures are perfect for bursty loads. If your traffic is highly variable then the ability to quickly scale-up and then scale-down will be of benefit to you - and there's a lot of complexity you don't have to manage to boot! Generally speaking such a solution is going to be cheaper and easier to configure and manage. That's a win-win!
If your load isn't as variable and you therefore have cloud resources always running, then it's almost always cheaper to host those applications on-prem - assuming you have on-prem hosting available to you. As I noted above, building data centers isn't cheap and it's almost always cheaper to stay in the cloud than it is to build a new data center, but if you already have data center(s) then your calculus is different.
Another thing to keep in mind at the moment is even if you decide to deploy on-prem you may not be able to get the hardware you need. A colleague of mine is working on a large project that's to be hosted on-prem. It's going to take 6-12 months to get all the required hardware. Even prior to the pandemic the backlog was 3-6 months because the major cloud providers are consuming all the hardware. Vendors would rather deal with buyers buying hardware by the tens of thousands than a shop buying a few dozen servers. You might even find your hardware delivery date getting pushed out as the "big guys" get their orders filled. It happens.
You know you can run a server in the cellar under your stairs.
You know that if you are a startup you can just keep servers in a closet and hope that no one turns on coffee machine while airco runs because it will pop circuit breakers, which will take down your server or maybe you might have UPS at least so maybe not :)
I have read horror stories about companies having such setups.
While they don't need multiple data centers, power, cooling and redundancy sounds for them like some kind of STD - getting cheap VPS should be default for such people. That is a win as well.
Many people will respond that "one big server" is a massive single point of failure, but in doing so they miss that it is also a single point of success. If you have a distributed system, you have to test and monitor lots of different failure scenarios. With a SPOS, you only have one thing to monitor. For a lot of cases the reliability of that SPOS is plenty.
Bonus: Just move it to the cloud, because AWS is definitely not its own SPOF and it never goes down taking half the internet with it.
"In total, this server has 128 cores with 256 simultaneous threads. With all of the cores working together, this server is capable of 4 TFLOPs of peak double precision computing performance. This server would sit at the top of the top500 supercomputer list in early 2000. It would take until 2007 for this server to leave the top500 list. Each CPU core is substantially more powerful than a single core from 10 years ago, and boasts a much wider computation pipeline."
I may be misunderstanding, but it looks like the micro-services comparison here is based on very high usage. Another use for micro-services, like lambda, is exactly the opposite. If you have very low usage, you aren't paying for cycles you don't use the way you would be if you either owned the machine, or rented it from AWS or DO and left it on all the time (which you'd have to do in order to serve that randomly-arriving one hit per day!)
If you have microservices that truly need to be separate services and have very little usage, you probably should use things like serverless computing. It scales down to 0 really well.
However, if you have a microservice with very little usage, turning that service into a library is probably a good idea.
Yes. I think that the former case is the situation we’re in. Lambdas are annoying (the whole AWS is annoying!) but, as you say, scales to 0 very well.
Why open yourself to random $300k bills from Amazon when the alternative is wasting a $5/month server?
I don’t understand what these numbers are referring to.
1 reply →
Let's be clear here, everything you can do in a "cloudy" environment, you could do on big servers yourself - but at what engineering and human resource cost? Because that's something many - if not most - hardware and 'on-prem' infra focussed people seem to miss. While cloud might seem expensive, most of the times, humans will be even more expensive (unless you're in very niche markets like HPC)
You could also have those big servers in the cloud (I think this is what many are doing; I certainly have). That gives you a lot of the cloud services e.g. for monitoring, but you get to not have to scale horizontally or rebuild for serverless just yet. Works great for Kubernetes workloads, too – have a single super beefy node (i.e. single-node node pool) and target just your resource-heavy workload onto that node.
As far as costs are concerned, however, I've found that for medium+ sized orgs, cloud doesn't actually save money in the HR department, the HR spend just shifts to devops people, who tend to be expensive and you can't really leave those roles empty since then you'll likely get an ungovernable mess of unsecured resources that waste a huge ton of money and may expose you to GDPR fines and all sorts of nasty breaches.
If done right, you get a ton of execution speed. Engineers have a lot of flexibility in terms of the services they use (which they'd otherwise have to buy through processes that tend to be long and tedious), scale as needed when needed, shift work to the cloud provider, while the devops/governance/security people have some pretty neat tools to make sure all that's done in a safe and compliant manner. That tends to be worth it many times over for a lot of orgs, if done effectively with that aim, though it may not do much for companies with relatively stagnant or very simple products. If you want to reduce HR costs, cloud is probably not going to help much.
It seems like lots of companies start in the cloud due to low commitments, and then later when they have more stability and demand and want to save costs, making bigger cloud commitments (RIs, enterprise agreements etc) are a turnkey way to save money but always leave you on the lower-efficiency cloud track. Has anyone had good experiences selectively offloading workloads from the cloud to bare metal servers nearby?
One advantage I didn't see in the article was the performance costs of network latency. If you're running everything on one server, every DB interaction, microservice interaction, etc. would not necessarily need to go over the network. I think it is safe to say, IO is generally the biggest performance bottleneck of most web applications. Minimizing/negating that should not be underestimated.
I see these debates and wish there was an approach that scaled better.
A single server (and a backup) really _is_ great. Until it's not, for whatever reason.
We need more frameworks that scale from a single box to many boxes, without starting over from scratch. There are a lot of solid approaches: Erlang/Elxir and the actor model comes to mind. But that approach is not perfect, and it's far from common place.
> We need more frameworks that scale from a single box to many boxes, without starting over from scratch.
I'm not sure I really understand what you're saying here. I suppose most applications are some kind of CRUD app these days, not all sure, but an awful lot. If we take that as an example, how is it difficult to go from one box to multiple?
It's not something you get for free, you need to put in time to provision any new infra (be it baremetal or some kind of cloud instance) but the act of scaling out is pretty straight forward.
Perhaps you're talking about stateful applications?
You've got features to ship. Stick your stuff on Render.com and don't think about it again. Even a dummy like me can manage that.
I am really interested in scalability problems.
I recommend the whitepaper Scalability! But at what cost?
My experience with Microservices is that they are very slow due to all the IO. We kind of want the development and developer scalability of decoupled services in addition to the computational and storage scalability in a disaggregated architecture.
So much of the latest tech news and solutions are from huge companies.
Let's be real here, how many of us get 10-100 million users/requests etc? My blog, langsoul.com, has 2, myself and a bot .
Simple dumb solutions seems best for 99% of cases, then, if you ever hit that 1%, well, you'll have shit tons of money to deal with it then!
One big sever, one big program, and one big 10x developer. Deploy websphere when you need isolation. The industry truly is going in spiral. Although, I must admit cloud providers really overplayed their hand when it comes to performance/buck and complexity.
What holds me back from doing this is how will I reduce latency from the calls coming from other side of the world when OVHcloud seemingly does not have datacenters all over the world? There is an noticeable lag when it comes to multiplayer games or even web applications.
OVH has datacenters in four continents:
https://us.ovhcloud.com/about/company/data-centers
So... I guess these folks haven't heard of latency before? Fairly sure you have to have "one big server" in every country if you do this. I feel like that would get rather costly compared to geographically distributed cloud services long term.
As opposed, to "many small servers" in every country? The vast majority of startups out there run out of a single AWS region with a CDN caching read-only content. You can apply the same CDN approach to a bare-metal server.
Yeah, but if I'm a startup and running only a small server, the cloud hosting costs are minimal. I'm not sure how you think it's cheaper to host tiny servers in lots of countries and pay someone to manage that for you. You'll need IT in every one of those locations to handle the service of your "small servers".
I run services globally for my company, there is no way we could do it. The fact that we just deploy containers to k8s all over the world works very well for us.
Before you give me the "oh k8s, well you don't know bare metal" please note that I'm an old hat that has done the legacy C# ASP.NET IIS workflows on bare metal for a long time. I have learned and migrated to k8s on AWS/GCloud and it is a huge improvement compared to what I used to deal with.
Lastly, as for your CDN discussion, we don't just host CDN's globally. We also host geo-located DB + k8s pods. Our service uses web sockets and latency is a real issue. We can't have 500 ms ping if we want to live update our client. We choose to host locally (in what is usually NOT a small server) so we get optimal ping for the live-interaction portion of our services that are used by millions of people every day.
4 replies →
>The vast majority of startups out there run out of a single AWS region with a CDN caching read-only content.
I wonder how many of them violate GDPR and similar laws in other countries in regards to personal data processing by processing everything in the US.
This is one of those problems that basically no one has. RTT from Japan to Washington D.C. is 160ms. There's very few applications where that amount of additional latency matters.
It adds up surprisingly quickly when you have to do a TLS handshake, download many resources on pageload etc. The TLS handshake alone costs 3 round-trips over the network.
1 reply →
The article explicitly mentiones CDN as something that you can outsource and also notes that the market there is competitive and the prices are low.
I once fired up an Azure instance with 4TB of RAM and hundreds of cores for a performance benchmark.
htop felt incredibly roomy, and I couldn’t help thin how my three previous projects would fit in with room to spare (albeit lacking redundancy, of course).
The problem with "one big server" is, you really need good IT/ops/sysadmin people who can think in non-cloud terms. (If you catch them installing docker on it, throw them into a lava pit immediately).
What's the problem with installing Docker so you can run containers of diferent distros, languages & flavors using the same one big server though?
Yeah I don't get that - if anything Docker would probably make the use case for "one big server" even easier to justify?
One server is for a hobby, not a business. Maybe that's fine, but keep that in mind. Backups at that level are something that keeps you from losing all data, not something that keeps you running and gets you up in any acceptable timeframe for most businesses.
That doesn't mean you need to use the cloud, it just means one big piece of hardware with all its single points of failure is often not enough. Two servers gets you so much more than one. You can make one a hot spare, or actually split services between them and have each be ready to take over for specific services for the other, greatly including your burst handling capability and giving you time to put more resources in place to keep n+1 redundancy going if you're using more than half of a server's resources.
Let's Encrypt's database server [1] would beg to differ. For businesses at certain scale two servers are really an overkill.
[1] https://letsencrypt.org/2021/01/21/next-gen-database-servers...
Do they actually say they don't have a slave to that database ready to take over? I seriously doubt Let's Encrypt has no spare.
Note I didn't say you shouldn't run one service (as in daemon) or set of services from one box, just that one box is not enough and you need that spare.
It Let's Encrypt actually has no spare for their database server and they're one hardware failure away from being down for what may be a large chunk of time (I highly doubt it), then I wouldn't want to use them even if free. Thankfully, I doubt your interpretation of what that article is saying.
1 reply →
That says they use a single database, as in a logical MySQL database. I don't see any claim that they use a single server. In fact, the title of the article you've linked suggests they use multiple.
2 replies →
This is exactly the OPs recommended solution:
> One Server (Plus a Backup) is Usually Plenty
The I guess my first sentence is about eqally as click-baity as the article title. ;)
How about using combination of CouchDB + Elixir for both horizontal and vertical scaling of backend?
What will be the pros and cons of this combo for backend stack?
This is why I like Cloudflare's worker model. It feels like the usefulness of cloud deployments, but with a pretty restrained pricing model.
Design systems such that eventual completion/consistency is a core tenant.
When it gets too slow, improve only the parts that are currently the slowest.
> But if I use Cloud Architecture, I Don’t Have to Hire Sysadmins
> Yes you do. They are just now called “Cloud Ops” and are under a different manager. Also, their ability to read the arcane documentation that comes from cloud companies and keep up with the corresponding torrents of updates and deprecations makes them 5x more expensive than system administrators.
I don't believe "Cloud Ops" is more complex than system administration, having studied for the CCNA so being on the Valley of Despair slope of the Dunning Kruger effect. If keeping up with cloud companies updates is that much of a challenge to warrant a 5x price over a SysAdmin then that's telling you something about their DX...
If you have just two servers how are you going to load-balance and fail-over them? Generally you need at least 3 nodes for any sort of quorum?
One major selling point against One Big Server: VCs and enterprise customers prefer (sometimes demand) Cloud.
For better or for worse. (Worse, IMO)
>Generally, the burstier your workload is, the more cloudy your architecture should be.
Well, crap dude, that's the web!
I have a feeling building your own "private cloud" is gonna be the next big thing :-D
Dedicated servers are hugely under valued / under appreciated.
I wouldn't recommend one, but at least two, for redundancy.
This is exactly what the article suggests.
Nice until your server gets hugged by HN.
oh nice, we are about to rediscover the mainframe.
I agree in spirit with much of the stuff said here.
Someone call Brahm
Use two…
I agree
is this clickbait?
although i do like the alternate version: use servers, but don’t be too serverly.
/tg/station, the largest open source multiplayer video game on github, gets cloudheads trying to help us "modernize" the game server for the cloud all the time.
Here's how that breaks down:
The servers (sorry, i mean compute) cost the same (before bandwidth, more on that at the bottom) to host one game server as we pay (amortized) per game server to host 5 game servers on a rented dedicated server. ($175/month for the rented server with 64gb of ram and a 10gbit uplink)
They run twice as slow because high core count slow clock speed servers aren't all they are cracked up to be, and our game engine is single threaded, but even if it wasn't, there is an overhead to multithreading things which combined with most high core count servers also having slow clock speed, rarely squares out to an actual increase in real world performance.
You can get the high clock speed units, they are twice to three times as expensive. And still run 20% slower over windows vms on rented bare metal because the sad fact is enterprise cpus by either intel or amd have slower clock speeds and single threaded performance then their gaming cpu counterparts, and getting gaming cpus for rented servers is piss easy, but next to impossible for cloud servers.
Each game server uses 2tb of bandwidth to host 70 player high pops. This works with 5 servers on 1 machine because our hosting provider gives us 15tb of bandwidth included in the price of the server.
Well now the cloud bill just got a new 0. 10 to 30x more expensive once you remember to price in bandwidth isn't looking too great.
"but it would make it cheaper for small downstreams to start out" until another youtuber mentions our tiny game, and every game server is hitting the 120 hard pop cap, and a bunch of downstreams get a surprise 4 digit bill for what would normally run 2 digits.
The take away from this being that even adding in docker or k8s deployment support to the game server is seen as creating the risk some kid bankrupts themselves trying to host a game server of their favorite game off their mcdonalds paycheck, and we tell such tech "pros" to sod off with their trendy money wasters.
> $175/month for the rented server with 64gb of ram and a 10gbit uplink)
Wow, what provider is that?
Hetzner's PX line offers 64GB ECC RAM, Xeon CPU, dual 1TB NVME for < $100/month. A dedicated 10Gbit b/w link (plus 10Gbit NIC) is then an extra ~$40/month on top (incls. 20TB/month traffic, with overage billed at $1/TB).
well of course, you can't scale SS13 servers, cloud is for stuff that scales in parallel like backends
All your eggs in one basket? A single host, really? Curmudgeonly opinions about microservices, cloud, and containers? Nostalgia for the time before 2010? All here. All you are missing is a rant about how the web was better before JavaScript.
It’s sad to see this kind of engineering malpractice voted to the top of HN. It’s even sadder to see how many people agree with it.