Comment by AlisdairO

1 month ago

Regarding schema changes and timeouts - while having timeouts in place is good advice, you can go further. While running the schema rollout, run a script alongside it that kills any workload conflicting with the aggressive locks the schema change is trying to take. This will greatly reduce the pain caused by lock contention, and prevent you from needing to repeatedly rerun statements on high-throughput tables.

This would be a particularly nice-to-have feature for Postgres - the option to have heavyweight locks just proactively cancel any conflicting workload. For any case where you have a high-throughput table, the damage of the heavyweight lock sitting there waiting (and blocking all new traffic) is generally much larger than just cancelling some running transactions.

2 comments

AlisdairO

zozbot234 1 month ago

Doesn't Postgres support transactional schema changes already? Why would you want to proactively kill work that's just going to complete after the schema change is done? Load balancing, throttling etc. is a different matter that has little to do with what you're proposing.

AlisdairO 1 month ago

It supports transactional schema changes, but that's not what I'm talking about. Most schema changes require heavyweight locks on the tables they're altering. The locks might be short lived (for example, just a catalogue update to add a column), but they are nevertheless heavyweight and both block and are blocked by other work.
SELECT or DML operations take a lightweight lock on the table that doesn't block most other work, but it does block these schema changes. While the schema change is waiting to acquire the table lock, all new operations (like new SELECTs, for example) get blocked until the schema change completes.
So the following scenario can be pretty disastrous:
* Start a long-running SELECT operation on table
* Attempt to apply schema change to the table
* All new work on the table is blocked until the SELECT completes and the schema change can apply.
* Production outage
What the ChatGPT folks do is set a lock timeout when applying the schema change to make it 'give up' after a few seconds. This works to avoid truly excessive impact (in their case, they may have micro-outages of up to 5s while trying to apply schema), but has problems - firstly, they then need to retry, which may lead to more micro-outages, and secondly there's no guarantee on a system with mixed workload that they will be able to force the change through, and the schema change just ends up getting starved out.
A better alternative for most workloads is to build a system that detects what workload is blocking your schema change and kills it, allowing the schema change to go through quickly and unblock all the work behind it. You'd still use a lock timeout with this to be on the safe side, but it shouldn't be necessary in most cases.
Side note on transactional DDL - for Postgres systems with high throughput, most people just use autocommit. Table level locks that get taken to perform the schema change get held for the duration of the transaction, and you generally want to really minimize the amount of time you hold them for.