Comment by wmanley
3 years ago
Disclosure, I'm the co-author of the blog post.
> My preferred approach to database migrations is the one implemented by Django: migrations are a sequence of transformations stored in files on disk, and the database includes a table that says which of those migrations have been applied. This keeps everything in version control and means there's no chance of the wrong migration being applied in the wrong way.
Do the transformations have to be applied in sequence? If so how does this handle resolving git conflicts? With our system mostly conflicts in our database.sql script are resolved automatically with git merging. And CI works well because we downgrade the database to the version of database.sql found in $(git merge-base main HEAD) before running integration tests.
I don't know about Django but in Rails† the migrations are numbered with a timestamp (in the filename itself), instead of a sequential number, precisely to avoid (git) merge conflicts. Then they're run in alphabetical order. There's a dedicated table to record which migrations have been run already.
† 10 years ago, anyway.
Using a timestamp in a migration system doesn't really solve the problem at edge cases though: the time-based ordering of migration file creation may differ from the time-based ordering of migration application, especially accounting for different git branches, and different DB environments (prod, staging, dev DBs for each developer).
The canonical problem is two engineers with branches that each add a new column to the end of the same table. It's easy to end up with schema drift between environments, or cases where executing the migrations in alphabetical order yields a different result than executing them in git history order. You can add various guard rails and band-aids to catch this, but it's ugly.
In contrast, declarative schema management systems inherently avoid this situation. For example, if two engineers try to add a new column to the end of the same table, one of them will get a git merge conflict -- exactly like if two branches in an application code repo tried to edit the same line of a function.
> Using a timestamp in a migration system doesn't really solve the problem at edge cases though: the time-based ordering of migration file creation may differ from the time-based ordering of migration application, especially accounting for different git branches, and different DB environments (prod, staging, dev DBs for each developer).
In many cases this might be "good enough", depending on change frequency.
For schema drift, I've given up on persisted databases outside of production. we have a docker image that gets made on each commit to our migrations. Staging environments have stopped making sense to me. It may or may not be representative of production. At that point, what does it do to serve me?
I'm sure there's some team scale where this doesn't hold up, especially around things like load testing. If that were a concern of mine, I'd still want to treat the environments as ephemeral, and would try my best to automatically setup/teardown the environment.
1 reply →
With Django migrations each migration needs to specify the migrations that have to be executed first - its dependencies.
Usually this is a single ID, but multiple IDs are supported.
Here's an example: https://github.com/CAVaccineInventory/vial/blob/0359ec1b44d0...
This means that you can actually have two migrations with the same number and they won't break, provided they have correctly specified their dependencies.
Django has a mechanism whereby you can resolve conflicts caused by two people working in branches: https://docs.djangoproject.com/en/4.0/topics/migrations/#ver...
In practice though I've found it best to avoid these - as a developer I'll reconstruct my migration if I find someone else has landed one before I did, to keep things simple and linear.