I can share some. Had a similar experience as the parent comment. I do support "one big database" but it requires a dedicated db admin team to solve the tragedy of the commons problem.
Say you have one big database. You have 300 engineers and 30-50 product managers shipping new features every day accountable to the C-Suite. They are all writing queries to retrieve the data they want. One more join, one more N+1 query. Tons of indexes to support all the different queries, to the point where your indexes exceed the size of your tables in many cases. Database maintenance is always someone else's problem, because hey, it's one big shared database. You keep scaling up the instance size cause "hardware is cheap". Eventually you hit the m6g.16xlarge. You add read replicas. Congratulations, Now you have an eventually consistent system. You have to start figuring out which queries can hit the replica and which ones always need the fresh data. You start getting long replication lag, but it varies and you don't know why. If you decide to try to optimize a single table, you find dozens or 100+ queries that access it. You didn't write them. The engineers who did don't work here anymore....
I could go on, and all these problems are certainly solvable and could have been avoided with a little foresight, but you don't always have good engineers at a startup doing the "right thing" before you show up.
I think this hits the nail right on the head, and it's the same criticism I have of and article itself: the framing is that you split up a database or use small vms or containers for performance reasons, but that's not the primary reason these things are useful; they are useful for people scaling first and foremost, and for technical scaling only secondarily.
The tragedy of the commons with one big shared database is real and paralyzing. Teams not having the flexibility to evolve their own schemas because they have no idea who depends on them in the giant shared schema is paralyzing. Defining service boundaries and APIs with clarity around backwards compatibility is a good solution. Sometimes this is taken too far, into services that are too small, but the service boundaries and explicit APIs are nonetheless good, mostly for people scaling.
> Defining service boundaries and APIs with clarity around backwards compatibility is a good solution.
Can't you do that with one big database? Every application gets an account that only gives it access to what it needs. Treat database tables as APIs: if you want access to someone else's, you have to negotiate to get it, so it's known who uses what. You don't have to have one account with access to everything that everyone shares. You could
These are real problems, but there can also be mitigations, particularly when it comes to people scaling. In many orgs, engineering teams are divided by feature mandate, and management calls it good-enough. In the beginning, the teams are empowered and feel productive by their focused mandates - it feels good to focus on your own work and largely ignore other teams. Before long, the Tragedy of the Commons effect develops.
I've had better success when feature-focused teams have tech-domain-focused "guilds" overlaid. Guilds aren't teams per-se, but they provide a level of coordination, and more importantly, permanency to communication among technical stakeholders. Teams don't make important decisions within their own bubble, and everything notable is written down. It's important for management to be bought in and value participation in these non-team activities when it comes to career advancement (not just pushing features).
In the end, you pick your poison, but I have certainly felt more empowered and productive in an org where there was effective collaboration on a smaller set of shared applications than the typical application soup that develops with full team ownership.
In uni we learnt about federated databases, i.e multiple autonomous, distributed, possibly heterogeneous databases joined together by some middleware to service user queries. I wonder how that would work for this situation, in the place of one single large database.
Federated databases are never usually mentioned in these kind of discussions involving 'web scale'. Maybe because of latency? I don't know
Introducing an enormous amount of overhead because training your software engineers to use acceptable amounts of resources instead of just accidentally crashing a node and not caring is a little ridiculous.
For whatever reason I've been thrown into a lot of companies at that exact moment when "hardware is cheap" and "not my problem" approaches couldn't cut it anymore...
So yes, it's super painful, and requires a lot of change in processes, mindsets, and it's hard to get everyone to understand things will get slower from there.
On the other end, micro-services and/or multi-DB is also super hard to get right. One of the surprise I had was all the "cache" that each services started silently adding on their little island when they realized the performance penalty they had from fetching data from half a dozen services on the more complicated operations. Or the same way DB abuse from one group could slow down everyone, and service abuse on the core parts (e.g. the "user" service) would impact most of the other services. More that a step forward, it felt a lot like a step sideways and continuing to do the same stuff, just in a different way.
My take from it was that teams that are good at split architectures are also usually good at monolith, and vice-versa. I feel from the parent who got stuck in the transition.
Sure, you'll get to m6g.16xlarge; but how many companies actually have oltp requirements that exceed the limits of single servers on AWS, eg u-12tb1.112xlarge or u-24tb1.metal (that's 12-24tb memory)?
I think these days the issues with high availability, cost/autoscaling/commitment, "tragedy of the commons", bureaucracy, and inter-team boundaries are much more likely to be the drawback than lack of raw power.
You do not need that many database developers, it's a myth. Facebook has 2 dedicated database engineers managing it. I work in United Nations, there is only 1 dedicated database developer in 1000+ team.
If you have a well designed database system. You do not need that many database engineers.
I do not disagree at all that what you are describing can happen. What I'm not understanding is why they're failing at multi year attempts to fix this.
Even in your scenario you could identify schemas and tables that can be separated and moved into a different database or at maturity into a more scalable NoSQL variety.
Generally once you get to the point that is being described that means you have a very strong sense on the of queries you are making. Once you have that it's not strictly necessary to even use a RDBMS, or at the very least, a single database server.
> Even in your scenario you could identify schemas and tables that can be separated and moved into a different database or at maturity into a more scalable NoSQL variety.
How? There's nothing tracking or reporting that (unless database management instrumentation has improved a lot recently), SQL queries aren't versioned or typechecked. Usually what happens is you move a table out and it seems fine, and then at the end of the month it turns out the billing job script was joining on that table and now your invoices aren't getting sent out.
> Generally once you get to the point that is being described that means you have a very strong sense on the of queries you are making.
No, just the opposite; you have zillions of queries being run from all over the case and no idea what they all are, because you've taught everyone that everything's in this one big database and they can just query for whatever it is they need.
I've seen this too. I guess 50% of query load were jobs that got deprecated in the next quarterly baseline.
It felt a system was needed to allocate query resource to teams, some kind of tradeable tokens that were scarce maybe, to incentivise more care and consciousness of the resource from the many users.
What we did was have a few levels of priority managed by a central org. It resulted in a lot of churn and hectares of indiscriminately killed query jobs every week, many that had business importance mixed in with the zombies.
Do you think it would make it better to have the tables hidden behind an API of views and stored procedures? Perhaps a small team of engineers maintaining that API would be be able to communicate effectively enough to avoid this "tragedy of commons" and balance the performance (and security!) needs of various clients?
This is so painfully painfully true. I’ve seen in born out personally at three different companies so far. Premature splitting up is bad too, but I think the “just use one Postgres for everything” crowd really underestimate how bad it gets in practice at scale
Maybe it’s all a matter of perspective? I’ve seen the ‘split things everywhere’ thing go wrong a lot more times than the ‘one big database’ thing. So I prefer the latter, but I imagine that may be different for other people.
Ultimately I think it’s mostly up to the quality of the team, not the technical choice.
We have over 200 monolith applications each accessing overlapping schemas of data with their own sets of stored procedures, views, and direct queries. To migrate a portion of that data out into it's own database requires, generally, refactoring a large subset of the 200 monolith apps to no longer get all the data in one query, but rather a portion of the data with the query and the rest of the data with a new service.
Sharding the data is equally difficult because even tracing who is writing the data is spread from one side of the system to the next. We've tried to do that trough an elaborate system of views, but as you can imagine, those are too slow and cover too much data for some critical applications so they end up breaking the shard. That, in and of itself, introduces additional complexity with the evolution of the products.
Couple that with the fact that even with these solutions, getting a large portion of the organization is not on board with these solutions (why can't we JUST buy more hardware? Get JUST bigger databases?) and these efforts end up being sabotaged from the beginning because not everyone thinks it's a good idea (And if you think you are different, I suggest just looking at the rest of the comments here in HN that provide 20 different solutions to the problem some of which are "why can't you just buy more hardware?")
But, to add to all of this, we also just have organizational deficiencies that have really harmed these efforts. Including things like a bunch of random scripts checked into who knows where that are apparently mission critical and reading/writing across the entire database. General for things like "the application isn't doing the right thing, so this cron job run every Wednesday will go in and fix things up" Quiet literally 1000s of those scripts have been written.
This isn't to say we've been 100% unsuccessful at splitting some of the data into it's own server. But, it's a long and hard slog.
>Including things like a bunch of random scripts checked into who knows where that are apparently mission critical and reading/writing across the entire database.
This hits pretty hard right now, after reading this whole discussion.
When there is a galaxy with countless star systems of data its good to have locality owners of data who publish for their usage as domain leaders, and build a system that makes subscription and access grants frictionless.
100% agreed and that's what I've been trying to promote within the company. It's simply hard to get the momentum up to really affect this change. Nobody likes the idea that things have to get a little slower (because you add a new layer between the data) before they can get faster.
fwiw hacking hundreds of apps literally making them worse by fragmenting their source of record doesn't sound like a good plan. it's no surprise you have saboteurs, your company probably wants to survive and your plan is to shatter its brain.
outside view: you should be trying to debottleneck your sql server if that's the plan the whole org can get behind. when they all want you to succeed you'll find a way.
> fwiw hacking hundreds of apps literally making them worse by fragmenting their source of record doesn't sound like a good plan. it's no surprise you have saboteurs, your company probably wants to survive and your plan is to shatter its brain.
The brain is already shattered. This wouldn't "literally make them worse", instead it would say that "now instead of everyone in the world hitting the users table directly and adding or removing data from that table, we have one service in charge of managing users".
Far too often we have queries like
SELECT b.*, u.username FROM Bar b
JOIN users u ON b.userId = u.id
And why is this query doing that? To get a human readable username that isn't needed but at one point years ago made it nicer to debug the application.
> you should be trying to debottleneck your sql server if that's the plan the whole org can get behind.
Did you read my post? We absolutely HAVE been working, for years now, at "debottlenecking our sql server". We have a fairly large team of DBAs (about 30) who's whole job is "debottlenecking our sql server". What I'm saying is that we are, and have been, at the edge (and more often than not over the edge) of tipping over. We CAN'T buy our way out of this with new hardware because we already have the best available hardware. We already have read only replicas. We already have tried (and failed at) sharding the data.
The problem is data doesn't have stewards. As a result, we've spent years developing application code where nobody got in the way of saying "Maybe you shouldn't join these two domains together? Maybe there's another way to do this?"
I can share some. Had a similar experience as the parent comment. I do support "one big database" but it requires a dedicated db admin team to solve the tragedy of the commons problem.
Say you have one big database. You have 300 engineers and 30-50 product managers shipping new features every day accountable to the C-Suite. They are all writing queries to retrieve the data they want. One more join, one more N+1 query. Tons of indexes to support all the different queries, to the point where your indexes exceed the size of your tables in many cases. Database maintenance is always someone else's problem, because hey, it's one big shared database. You keep scaling up the instance size cause "hardware is cheap". Eventually you hit the m6g.16xlarge. You add read replicas. Congratulations, Now you have an eventually consistent system. You have to start figuring out which queries can hit the replica and which ones always need the fresh data. You start getting long replication lag, but it varies and you don't know why. If you decide to try to optimize a single table, you find dozens or 100+ queries that access it. You didn't write them. The engineers who did don't work here anymore....
I could go on, and all these problems are certainly solvable and could have been avoided with a little foresight, but you don't always have good engineers at a startup doing the "right thing" before you show up.
I think this hits the nail right on the head, and it's the same criticism I have of and article itself: the framing is that you split up a database or use small vms or containers for performance reasons, but that's not the primary reason these things are useful; they are useful for people scaling first and foremost, and for technical scaling only secondarily.
The tragedy of the commons with one big shared database is real and paralyzing. Teams not having the flexibility to evolve their own schemas because they have no idea who depends on them in the giant shared schema is paralyzing. Defining service boundaries and APIs with clarity around backwards compatibility is a good solution. Sometimes this is taken too far, into services that are too small, but the service boundaries and explicit APIs are nonetheless good, mostly for people scaling.
> Defining service boundaries and APIs with clarity around backwards compatibility is a good solution.
Can't you do that with one big database? Every application gets an account that only gives it access to what it needs. Treat database tables as APIs: if you want access to someone else's, you have to negotiate to get it, so it's known who uses what. You don't have to have one account with access to everything that everyone shares. You could
4 replies →
These are real problems, but there can also be mitigations, particularly when it comes to people scaling. In many orgs, engineering teams are divided by feature mandate, and management calls it good-enough. In the beginning, the teams are empowered and feel productive by their focused mandates - it feels good to focus on your own work and largely ignore other teams. Before long, the Tragedy of the Commons effect develops.
I've had better success when feature-focused teams have tech-domain-focused "guilds" overlaid. Guilds aren't teams per-se, but they provide a level of coordination, and more importantly, permanency to communication among technical stakeholders. Teams don't make important decisions within their own bubble, and everything notable is written down. It's important for management to be bought in and value participation in these non-team activities when it comes to career advancement (not just pushing features).
In the end, you pick your poison, but I have certainly felt more empowered and productive in an org where there was effective collaboration on a smaller set of shared applications than the typical application soup that develops with full team ownership.
In uni we learnt about federated databases, i.e multiple autonomous, distributed, possibly heterogeneous databases joined together by some middleware to service user queries. I wonder how that would work for this situation, in the place of one single large database.
Federated databases are never usually mentioned in these kind of discussions involving 'web scale'. Maybe because of latency? I don't know
> Teams not having the flexibility to evolve their own schemas because they have no idea who depends on them
This sounds like a problem of testing and organization to me, not a problem with single big databases.
1 reply →
Introducing an enormous amount of overhead because training your software engineers to use acceptable amounts of resources instead of just accidentally crashing a node and not caring is a little ridiculous.
For whatever reason I've been thrown into a lot of companies at that exact moment when "hardware is cheap" and "not my problem" approaches couldn't cut it anymore...
So yes, it's super painful, and requires a lot of change in processes, mindsets, and it's hard to get everyone to understand things will get slower from there.
On the other end, micro-services and/or multi-DB is also super hard to get right. One of the surprise I had was all the "cache" that each services started silently adding on their little island when they realized the performance penalty they had from fetching data from half a dozen services on the more complicated operations. Or the same way DB abuse from one group could slow down everyone, and service abuse on the core parts (e.g. the "user" service) would impact most of the other services. More that a step forward, it felt a lot like a step sideways and continuing to do the same stuff, just in a different way.
My take from it was that teams that are good at split architectures are also usually good at monolith, and vice-versa. I feel from the parent who got stuck in the transition.
> teams that are good at split architectures are also usually good at monolith, and vice-versa.
aka, low-competency engineers will not outperform using better processes or project management.
The way, imho, is to up-skill the team (which is only possible if it was small unfortunately).
Sure, you'll get to m6g.16xlarge; but how many companies actually have oltp requirements that exceed the limits of single servers on AWS, eg u-12tb1.112xlarge or u-24tb1.metal (that's 12-24tb memory)? I think these days the issues with high availability, cost/autoscaling/commitment, "tragedy of the commons", bureaucracy, and inter-team boundaries are much more likely to be the drawback than lack of raw power.
You do not need that many database developers, it's a myth. Facebook has 2 dedicated database engineers managing it. I work in United Nations, there is only 1 dedicated database developer in 1000+ team.
If you have a well designed database system. You do not need that many database engineers.
I do not disagree at all that what you are describing can happen. What I'm not understanding is why they're failing at multi year attempts to fix this.
Even in your scenario you could identify schemas and tables that can be separated and moved into a different database or at maturity into a more scalable NoSQL variety.
Generally once you get to the point that is being described that means you have a very strong sense on the of queries you are making. Once you have that it's not strictly necessary to even use a RDBMS, or at the very least, a single database server.
> Even in your scenario you could identify schemas and tables that can be separated and moved into a different database or at maturity into a more scalable NoSQL variety.
How? There's nothing tracking or reporting that (unless database management instrumentation has improved a lot recently), SQL queries aren't versioned or typechecked. Usually what happens is you move a table out and it seems fine, and then at the end of the month it turns out the billing job script was joining on that table and now your invoices aren't getting sent out.
> Generally once you get to the point that is being described that means you have a very strong sense on the of queries you are making.
No, just the opposite; you have zillions of queries being run from all over the case and no idea what they all are, because you've taught everyone that everything's in this one big database and they can just query for whatever it is they need.
6 replies →
I've seen this too. I guess 50% of query load were jobs that got deprecated in the next quarterly baseline.
It felt a system was needed to allocate query resource to teams, some kind of tradeable tokens that were scarce maybe, to incentivise more care and consciousness of the resource from the many users.
What we did was have a few levels of priority managed by a central org. It resulted in a lot of churn and hectares of indiscriminately killed query jobs every week, many that had business importance mixed in with the zombies.
Let me tell you about my new Blockchain CorpoCoin... (/s)
Do you think it would make it better to have the tables hidden behind an API of views and stored procedures? Perhaps a small team of engineers maintaining that API would be be able to communicate effectively enough to avoid this "tragedy of commons" and balance the performance (and security!) needs of various clients?
This is so painfully painfully true. I’ve seen in born out personally at three different companies so far. Premature splitting up is bad too, but I think the “just use one Postgres for everything” crowd really underestimate how bad it gets in practice at scale
Maybe it’s all a matter of perspective? I’ve seen the ‘split things everywhere’ thing go wrong a lot more times than the ‘one big database’ thing. So I prefer the latter, but I imagine that may be different for other people.
Ultimately I think it’s mostly up to the quality of the team, not the technical choice.
1 reply →
I did in the original comment.
We have over 200 monolith applications each accessing overlapping schemas of data with their own sets of stored procedures, views, and direct queries. To migrate a portion of that data out into it's own database requires, generally, refactoring a large subset of the 200 monolith apps to no longer get all the data in one query, but rather a portion of the data with the query and the rest of the data with a new service.
Sharding the data is equally difficult because even tracing who is writing the data is spread from one side of the system to the next. We've tried to do that trough an elaborate system of views, but as you can imagine, those are too slow and cover too much data for some critical applications so they end up breaking the shard. That, in and of itself, introduces additional complexity with the evolution of the products.
Couple that with the fact that even with these solutions, getting a large portion of the organization is not on board with these solutions (why can't we JUST buy more hardware? Get JUST bigger databases?) and these efforts end up being sabotaged from the beginning because not everyone thinks it's a good idea (And if you think you are different, I suggest just looking at the rest of the comments here in HN that provide 20 different solutions to the problem some of which are "why can't you just buy more hardware?")
But, to add to all of this, we also just have organizational deficiencies that have really harmed these efforts. Including things like a bunch of random scripts checked into who knows where that are apparently mission critical and reading/writing across the entire database. General for things like "the application isn't doing the right thing, so this cron job run every Wednesday will go in and fix things up" Quiet literally 1000s of those scripts have been written.
This isn't to say we've been 100% unsuccessful at splitting some of the data into it's own server. But, it's a long and hard slog.
>Including things like a bunch of random scripts checked into who knows where that are apparently mission critical and reading/writing across the entire database.
This hits pretty hard right now, after reading this whole discussion.
When there is a galaxy with countless star systems of data its good to have locality owners of data who publish for their usage as domain leaders, and build a system that makes subscription and access grants frictionless.
100% agreed and that's what I've been trying to promote within the company. It's simply hard to get the momentum up to really affect this change. Nobody likes the idea that things have to get a little slower (because you add a new layer between the data) before they can get faster.
fwiw hacking hundreds of apps literally making them worse by fragmenting their source of record doesn't sound like a good plan. it's no surprise you have saboteurs, your company probably wants to survive and your plan is to shatter its brain.
outside view: you should be trying to debottleneck your sql server if that's the plan the whole org can get behind. when they all want you to succeed you'll find a way.
> fwiw hacking hundreds of apps literally making them worse by fragmenting their source of record doesn't sound like a good plan. it's no surprise you have saboteurs, your company probably wants to survive and your plan is to shatter its brain.
The brain is already shattered. This wouldn't "literally make them worse", instead it would say that "now instead of everyone in the world hitting the users table directly and adding or removing data from that table, we have one service in charge of managing users".
Far too often we have queries like
SELECT b.*, u.username FROM Bar b JOIN users u ON b.userId = u.id
And why is this query doing that? To get a human readable username that isn't needed but at one point years ago made it nicer to debug the application.
> you should be trying to debottleneck your sql server if that's the plan the whole org can get behind.
Did you read my post? We absolutely HAVE been working, for years now, at "debottlenecking our sql server". We have a fairly large team of DBAs (about 30) who's whole job is "debottlenecking our sql server". What I'm saying is that we are, and have been, at the edge (and more often than not over the edge) of tipping over. We CAN'T buy our way out of this with new hardware because we already have the best available hardware. We already have read only replicas. We already have tried (and failed at) sharding the data.
The problem is data doesn't have stewards. As a result, we've spent years developing application code where nobody got in the way of saying "Maybe you shouldn't join these two domains together? Maybe there's another way to do this?"
3 replies →
Just trust me, I can refute anything with vague explanations.