Comment by belak
3 years ago
This is absolutely true - when I was at Bitbucket (ages ago at this point) and we were having issues with our DB server (mostly due to scaling), almost everyone we talked to said "buy a bigger box until you can't any more" because of how complex (and indirectly expensive) the alternatives are - sharding and microservices both have a ton more failure points than a single large box.
I'm sure they eventually moved off that single primary box, but for many years Bitbucket was run off 1 primary in each datacenter (with a failover), and a few read-only copies. If you're getting to the point where one database isn't enough, you're either doing something pretty weird, are working on a specific problem which needs a more complicated setup, or have grown to the point where investing in a microservice architecture starts to make sense.
One issue I've seen with this is that if you have a single, very large database, it can take a very, very long time to restore from backups. Or for that matter just taking backups.
I'd be interested to know if anyone has a good solution for that.
Here's the way it works for, say, Postgresql:
- you rsync or zfs send the database files from machine A to machine B. You would like the database to be off during this process, which will make it consistent. The big advantage of ZFS is that you can stop PG, snapshot the filesystem, and turn PG on again immediately, then send the snapshot. Machine B is now a cold backup replica of A. Your loss potential is limited to the time between backups.
- after the previous step is completed, you arrange for machine A to send WAL files to machine B. It's well documented. You could use rsync or scp here. It happens automatically and frequently. Machine B is now a warm replica of A -- if you need to turn it on in an emergency, you will only have lost one WAL file's worth of changes.
- after that step is completed, you give machine B credentials to login to A for live replication. Machine B is now a live, very slightly delayed read-only replica of A. Anything that A processes will be updated on B as soon as it is received.
You can go further and arrange to load balance requests between read-only replicas, while sending the write requests to the primary; you can look at Citus (now open source) to add multi-primary clustering.
This isn't really a backup, it's redundancy which is good thing but not the same as a backup solution. You can't get out of a drop table production type event this way.
11 replies →
Do you even have to stop Postgres if using ZFS snapshots? ZFS snapshots are atomic, so I’d expect that to be fine. If it wasn’t fine, that would also mean Postgres couldn’t handle power failure or other sudden failures.
1 reply →
Presumably it doesn't matter if you break your DB up into smaller DBs, you still have the same amount of data to back up no matter what. However, now you also have the problem of snapshot consistency to worry about.
If you need to backup/restore just one set of tables, you can do that with a single DB server without taking the rest offline.
> you still have the same amount of data to back up no matter what
But you can restore/back up the databases in parallel.
> If you need to backup/restore just one set of tables, you can do that with a single DB server without taking the rest offline.
I'm not aware of a good way to restore just a few tables from a full db backup. At least that doesn't require copying over all the data (because the backup is stored over the network, not on a local disk). And that may be desirable to recover from say a bug corrupting or deleting a customer's data.
Try out pg_probackup. It works on database files directly. Restore is as fast as you can write on your ssd.
I've setup a pgsql server with timescaledb recently. Continuing backup based on WAL takes seconds each hour and a complete restore takes 15 minutes for almost 300 GB of data because the 1 GBit connection to the backup server is the bottleneck.
For MySQL there is xtrabackup - https://www.percona.com/software/mysql-database/percona-xtra....
On mariadb you can tell the replica to enter into a snapshotable state[1] and take a simple lvm snapshot, tell the the database it's over, backup your snapshot somewhere else and finally delete the snapshot.
1) https://mariadb.com/kb/en/storage-snapshots-and-backup-stage...
I found this approach pretty cool in that regard: https://github.com/pgbackrest/pgbackrest
Not a solution but using event sourcing would have prevented this.
What if your product simply stores a lot of data (ie a search engine) How is that weird?
That's fair - I added "are working on a specific problem which needs a more complicated setup" to my original comment as a nicer way of referring to edge cases like search engines. I still believe that 99% of applications would function perfectly fine with a single primary DB.
Depends what you mean by a database I guess. I take it to mean an RDBMS.
RDBMSs provide guarantees that web searching doesn't need. You can afford to lose a pieces of data, provide not-quite-perfect results for web stuff. It's just wrong for an RDBMS.
What if you are using the database as a system of record to index into a real search engine like Elasticsearch? For a product where you have tons of data to search from (ie text from web pages)
3 replies →
This is not typically going to be stored in an ACID-compliant RDBMS, which is where the most common scaling problem occurs. Search engines, document stores, adtech, eventing, etc. are likely going to have a different storage mechanism where consistency isn't as important.
a search engine won't need joins, but other things (ie text indexing) that can be split in a relatively easier way.