Comment by BenoitEssiambre
3 years ago
I'm glad this is becoming conventional wisdom. I used to argue this in these pages a few years ago and would get downvoted below the posts telling people to split everything into microservices separated by queues (although I suppose it's making me lose my competitive advantage when everyone else is building lean and mean infrastructure too).
In my mind, reasons involve keeping transactional integrity, ACID compliance, better error propagation, avoiding the hundreds of impossible to solve roadblocks of distributed systems (https://groups.csail.mit.edu/tds/papers/Lynch/MIT-LCS-TM-394...).
But also it is about pushing the limits of what is physically possible in computing. As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.
Physical efficiency is about keeping data close to where it's processed. Monoliths can make much better use of L1, L2, L3, and ram caches than distributed systems for speedups often in the order of 100X to 1000X.
Sure it's easier to throw more hardware at the problem with distributed systems but the downsides are significant so be sure you really need it.
Now there is a corollary to using monoliths. Since you only have one db, that db should be treated as somewhat sacred, you want to avoid wasting resources inside it. This means being a bit more careful about how you are storing things, using the smallest data structures, normalizing when you can etc. This is not to save disk, disk is cheap. This is to make efficient use of L1,L2,L3 and ram.
I've seen boolean true or false values saved as large JSON documents. {"usersetting1": true, "usersetting2":fasle "setting1name":"name" etc.} with 10 bits of data ending up as a 1k JSON document. Avoid this! Storing documents means, the keys, the full table schema is in every row. It has its uses but if you can predefine your schema and use the smallest types needed, you are gaining much performance mostly through much higher cache efficiency!
> I'm glad this is becoming conventional wisdom
It's not though. You're just seeing the most popular opinion on HN.
In reality it is nuanced like most real-world tech decisions are. Some use cases necessitate a distributed or sharded database, some work better with a single server and some are simply going to outsource the problem to some vendor.
> outsource the problem to some vendor
At least that way you can be certain of failure.
Exactly. The HN crowd is obsessed with minimalism and reducing "bloat".
It has become a cult, where availability and scale requirements are apparently fiction. "You are not FAANG, you don't have these requirements."
> I'm glad this is becoming conventional wisdom
My hunch is that computers caught up. Back in the early 2000's horizontal scaling was the only way. You simply couldn't handle even reasonably mediocre loads on a single machine.
As computing becomes cheaper, horizontal scaling is starting to look more and more like unnecessary complexity for even surprisingly large/popular apps.
I mean you can buy a consumer off-the-shelf machine with 1.5TB of memory these days. 20 years ago, when microservices started gaining popularity, 1.5TB RAM in a single machine was basically unimaginable.
Honestly from my perspective it feels like microservices arose strongly in popularity precisely when it was becoming less necessary. In particular the mass adoption of SSD storage massively changed the nature of the game, but awareness of that among regular developers seemed not as pervasive as it should have been.
'over the wire' is less obvious than it used to be.
If you're in k8s pod, those calls are really kernel calls. Sure you're serializing and process switching where you could be just making a method call, but we had to do something.
I'm seeing less 'balls of mud' with microservices. Thats not zero balls of mud. But its not a given for almost every code base I wander into.
To clarify, I think stateless microservices are good. It's when you have too many DBs (and sometimes too many queues) that you run into problems.
A single instance of PostgreSQL is, in most situations, almost miraculously effective at coordinating concurrent and parallel state mutations. To me that's one of the most important characteristic of an RDBMS. Storing data is a simpler secondary problem. Managing concurrency is the hard problem that I need most help with from my DB and having a monolithic DB enables the coordination of everything else including stateless peripheral services without resulting in race conditions, conflicts or data corruption.
SQL is the most popular mostly functional language. This might be because managing persistent state and keeping data organized and low entropy, is where you get the most benefit from using a functional approach that doesn't add more state. This adds to the effectiveness of using a single transactional DB.
I must admit that even distributed DBs, like Cockroach and Yugabyte have recognized this and use the PostgreSQL syntax and protocol. This is good though, it means that if you really need to scale beyond PostgreSQL, you have PostgreSQL compatible options.
> I'm seeing less 'balls of mud' with microservices.
The parallel to "balls of mud" with microservices is tiny services that seem almost devoid of any business logic and all the actual business logic is encapsulated in the calls between different services, lambda functions, and so on.
That's quite nightmarish from a maintenance perspective too, because now it's almost impossible to look at the system from the outside and understand what it's doing. It also means that conventional tooling can't help you anymore: you don't get compiler errors if your lambda function calls an endpoint that doesn't exist anymore.
Big balls of mud are horrible (I'm currently working with a big ball of mud monolith, I know what I'm talking about), but you can create a different kind of mess with microservices too. Then there all the other problems, such as operational complexity, or "I now need to update log4j across 30 services".
In the end, a well-engineered system needs disciple and architectural skills, as well as a healthy engineering culture where tech debt can be paid off, regardless of whether it's a monolith, a microservice architecture or something in between.
> I'm seeing less 'balls of mud' with microservices. Thats not zero balls of mud.
They are probably younger. Give them time :P
>"I'm glad this is becoming conventional wisdom. "
Yup, this is what I've always done and it works wonders. Since I do not have bosses, just a clients I do not give a flying fuck about latest fashion and do what actually makes sense for me and said clients.
I've never understood this logic for webapps. If you're building a web application, congratulations, you're building a distributed system, you don't get a choice. You can't actually use transactional integrity or ACID compliance because you've got to send everything to and from your users via HTTP request/response. So you end up paying all the performance, scalability, flexibility, and especially reliability costs of an RDBMS, being careful about how much data you're storing, and getting zilch for it, because you end up building a system that's still last-write-wins and still loses user data whenever two users do anything at the same time (or you build your own transactional logic to solve that - exactly the same way as you would if you were using a distributed datastore).
Distributed systems can also make efficient use of cache, in fact they can do more of it because they have more of it by having more nodes. If you get your dataflow right then you'll have performance that's as good as a monolith on a tiny dataset but keep that performance as you scale up. Not only that, but you can perform a lot better than an ACID system ever could, because you can do things like asynchronously updating secondary indices after the data is committed. But most importantly you have easy failover from day 1, you have easy scaling from day 1, and you can just not worry about that and focus on your actual business problem.
Relational databases are largely a solution in search of a problem, at least for web systems. (They make sense as a reporting datastore to support ad-hoc exploratory queries, but there's never a good reason to use them for your live/"OLTP" data).
I really don't understand how anything of what you wrote follows from the fact that you're building a web-app. Why do you lose user data when two users do anything at the same time? That has never happened to me with any RDBMS.
And why would HTTP requests prevent me from using transactional logic? If a user issues a command such as "copy this data (a forum thread, or a Confluence page, or whatever) to a different place" and that copy operation might actually involve a number of different tables, I can use a transaction and make sure that the action either succeeds fully or is rolled back in case of an error; no extra logic required.
I couldn't disagree more with your conclusion even if I wanted to. Relational databases are great. We should use more of them.
> I really don't understand how anything of what you wrote follows from the fact that you're building a web-app. Why do you lose user data when two users do anything at the same time? That has never happened to me with any RDBMS.
> And why would HTTP requests prevent me from using transactional logic? If a user issues a command such as "copy this data (a forum thread, or a Confluence page, or whatever) to a different place" and that copy operation might actually involve a number of different tables, I can use a transaction and make sure that the action either succeeds fully or is rolled back in case of an error; no extra logic required.
Sure, if you can represent what the user wants to do as a "command" like that, that doesn't rely on a particular state of the world, then you're fine. Note that this is also exactly the case that an eventually consistent event-sourcing style system will handle fine.
The case where transactions would actually be useful is the case where a user wants to read something and modify something based on what they read. But you can't possibly do that over the web, because they read the data in one request and write it in another request that may never come. If two people try to edit the same wiki page at the same time, either one of them loses their data, or you implement some kind of "userspace" reconciliation logic - but database transactions can't help you with that. If one user tries to make a new post in a forum thread at the same time as another user deletes that thread, probably they get an error that throws away all their data, because storing it would break referential integrity.
3 replies →
Http requests work great with relational dbs. This is not UDP. If the TCP connection is broken, an operation will either have finished or stopped and rolledback atomically and unless you've placed unneeded queues in there, you should know of success immediately.
When you get the http response, you will know the data is fully committed, data that uses it can be refreshed immediately and is accessible to all other systems immediately so you can perform next steps relying on those hard guarantees. Behind the http request, a transaction can be opened to do a bunch of stuff including API calls to other systems if needed and commit the results as an atomic transaction. There are tons of benefit using it with http.
But you can't do interaction between the two ends of a HTTP request. The caller makes an inert request, whatever processing happens downstream of that might as well be offline because it's not and can never be interactive within a single transaction.
1 reply →
>As Admiral Grace Hopper would point out (https://www.youtube.com/watch?v=9eyFDBPk4Yw ) doing distance over network wires involves hard latency constraints, not to mention dealing with congestions over these wires.
Even accounting for CDNs, a distributed system is inherently more capable of bringing data closer to geographically distributed end users, thus lowering latency.