← Back to context

Comment by simonw

3 years ago

This is an interesting approach to this problem.

I like the idea of building the new schema as an in-memory database. You could even go a step further and compare them using a SQL query that joins across the two databases (SQLite supports this, it's pretty neat).

I'm a bit nervous about how edge-cases might screw things up, but a nice thing about SQLite is that backups are really cheap so I guess you can protect against any risks by creating a backup before running this script.

My preferred approach to database migrations is the one implemented by Django: migrations are a sequence of transformations stored in files on disk, and the database includes a table that says which of those migrations have been applied. This keeps everything in version control and means there's no chance of the wrong migration being applied in the wrong way.

It's quite a bit more work to setup though. I don't have my own Django-style migration system for SQLite yet and I really need one.

I really love how clean and short the implementation of this is! https://david.rothlis.net/declarative-schema-migration-for-s...

I worked with Django at a startup for <1y on a team of ~10 engineers. Django migrations are easy and better than a lot of what's out there, but I thought they could be better for the following reasons:

1. Merge migrations are annoying. Especially when you get some engineering velocity going, it's very common to get some conflict. Almost every single time, the merge migrations did not conflict at all.

2. Migrations made checking out someone's code to run on your local machine difficult. You gotta remember to unmigrate before checking out your own branch again, otherwise some future migration may fail.

3. Migrating/unmigrating can be hard to reason with. There has been a lot of cases were a junior engineer pulls the latest code, then realizes their DB is borked due to migrations (maybe a phantom migration that is no longer in their codebase). In some of those cases, we just tell them to start from scratch, since it's easier.

The solution outlined by OP is nice because it eliminates the overhead of those migration files, and only looks at the intended state (the SQL file), and the actual state (the database). While the migration files do provide some necessary ceremony for dealing with user data in prod, I'm not sure if we've ever gotten anything good out of said ceremony versus the overhead we needed to maintain.

  • Yeah I think what you're describing is an issue for all "traditional" database migrations, that is, switching between branches is an issue.

    I think the best thing to do is, during development, recreate and re-populate the database on every branch change automatically. I have that set up in my current project, although I have to remove the database file manually for now. My test data is hardcoded into my codebase, using the repository access code to save the data on startup, if a "load test data" flag is set.

    of course, using an in-memory database would also resolve it, but then any changes I make in the data at runtime would be lost every time.

Disclosure, I'm the co-author of the blog post.

> My preferred approach to database migrations is the one implemented by Django: migrations are a sequence of transformations stored in files on disk, and the database includes a table that says which of those migrations have been applied. This keeps everything in version control and means there's no chance of the wrong migration being applied in the wrong way.

Do the transformations have to be applied in sequence? If so how does this handle resolving git conflicts? With our system mostly conflicts in our database.sql script are resolved automatically with git merging. And CI works well because we downgrade the database to the version of database.sql found in $(git merge-base main HEAD) before running integration tests.

  • I don't know about Django but in Rails† the migrations are numbered with a timestamp (in the filename itself), instead of a sequential number, precisely to avoid (git) merge conflicts. Then they're run in alphabetical order. There's a dedicated table to record which migrations have been run already.

    † 10 years ago, anyway.

    • Using a timestamp in a migration system doesn't really solve the problem at edge cases though: the time-based ordering of migration file creation may differ from the time-based ordering of migration application, especially accounting for different git branches, and different DB environments (prod, staging, dev DBs for each developer).

      The canonical problem is two engineers with branches that each add a new column to the end of the same table. It's easy to end up with schema drift between environments, or cases where executing the migrations in alphabetical order yields a different result than executing them in git history order. You can add various guard rails and band-aids to catch this, but it's ugly.

      In contrast, declarative schema management systems inherently avoid this situation. For example, if two engineers try to add a new column to the end of the same table, one of them will get a git merge conflict -- exactly like if two branches in an application code repo tried to edit the same line of a function.

      2 replies →

  • With Django migrations each migration needs to specify the migrations that have to be executed first - its dependencies.

    Usually this is a single ID, but multiple IDs are supported.

    Here's an example: https://github.com/CAVaccineInventory/vial/blob/0359ec1b44d0...

    This means that you can actually have two migrations with the same number and they won't break, provided they have correctly specified their dependencies.

    Django has a mechanism whereby you can resolve conflicts caused by two people working in branches: https://docs.djangoproject.com/en/4.0/topics/migrations/#ver...

    In practice though I've found it best to avoid these - as a developer I'll reconstruct my migration if I find someone else has landed one before I did, to keep things simple and linear.

Disclosure: I'm the creator of Skeema (https://www.skeema.io), a declarative schema management system for MySQL/MariaDB mentioned at the bottom of the original article.

> migrations are a sequence of transformations stored in files on disk [...] This keeps everything in version control

While traditional migration systems do give you something to put in version control, I'd argue that what they give you is the wrong abstraction, at least for schema management / DDL.

The migration files need to be ordered in some way based on their filenames, and their contents contain imperative DDL that must be executed in that order. This is conceptually equivalent to storing a bunch of .patch files in a git repo, and then using a file naming scheme to indicate the intended application of patches.

But, version control systems already have a first-class notion of ordering and diff'ing -- after all that's one of the main purposes of the version control system! So at best this scheme is redundant, and at worst it's a second source-of-truth which may conflict.

And then, separate from the repo ordering/source-of-truth problem, you also have a third source of truth which may conflict: the actual state on any given database server, which could also include further drift. Sometimes things get run out-of-order in the course of development, sometimes people run things manually as part of an emergency hot-fix, etc.

Finally, there's the additional problem of these migration files not truly being immutable. If someone accidentally edits, deletes, or renames one of these migration files, the result is basically undefined, if that migration has already been executed on at least one database environment. Some migration systems store hashes of the migration contents, which helps for catching the edited migration after execution case, but not the other cases.

> and means there's no chance of the wrong migration being applied in the wrong way.

Assuming that your commit order, filename order, and execution order are always all aligned, and your migration files are immutable, this is true. And some engineers really value having that history of ALTER TABLE statements.

But of course there's another wrinkle, at least in MySQL/MariaDB as well as SQLite: there are many situations where you're not really using ALTER TABLE, but instead are using a separate external tool to mimic the desired effects of the ALTER, due to various peculiarities and shortcomings of the database's ALTER support.

So if you're not really running an ALTER anyway, is a repo with a bunch of name-ordered ALTER statements even valuable? In contrast, a declarative system gives you a repo which expresses the intended state of the database, which may be used as an easy reference for engineers, and also contains useful `git blame` results to see why/when/who made a change.

That all said, I do concede that imperative migrations are useful for data migrations (DML), as well as renames (renaming a table or renaming a column within a table). The declarative model doesn't capture imperative changes. But nothing prevents you from using a declarative system for schema changes and pairing it with a traditional migration system for imperative changes.