Comment by bostik
13 hours ago
If you use streaming replication (ie. WAL shipping over the replication connection), a single replica getting really far behind can eventually cause the primary to block writes. Some time back I commented on the behaviour: https://news.ycombinator.com/item?id=45758543
You could use asynchronous WAL shipping, where the WAL files are uploaded to an object store (S3 / Azure Blob) and the streaming connections are only used to signal the position of WAL head to the replicas. The replicas will then fetch the WAL files from the object store and replay them independently. This is what wall-g does, for a real life example.
The tradeoffs when using that mechanism are pretty funky, though. For one, the strategy imposes a hard lower bound to replication delay because even the happy path is now "primary writes WAL file; primary updates WAL head position; primary uploads WAL file to object store; replica downloads WAL file from object store; replica replays WAL file". In case of unhappy write bursts the delay can go up significantly. You are also subject to any object store and/or API rate limits. The setup makes replication delays slightly more complex to monitor for, but for a competent engineering team that shouldn't be an issue.
But it is rather hilarious (in retrospect only) when an object store performance degdaration takes all your replicas effectively offline and the readers fail over to getting their up-to-date data from the single primary.
There is no backpressure from replication and streaming replication is asynchronous by default. Replicas can ask the primary to hold back garbage collection (off by default), which will eventually cause a slow down, but not blocking. Lagging replicas can also ask the primary to hold onto WAL needed to catch up (again, off by default), which will eventually cause disk to fill up, which I guess is blocking if you squint hard enough. Both will take considerable amount of time and are easily averted by monitoring and kicking out unhealthy replicas.
Yeah, you'll definitely want to set things like `max_standby_streaming_delay` and friends to ensure things are bound correctly.
> If you use streaming replication (ie. WAL shipping over the replication connection), a single replica getting really far behind can eventually cause the primary to block writes. Some time back I commented on the behaviour: https://news.ycombinator.com/item?id=45758543
I'd like to know more, since I don't understand how this could happen. When you say "block", what do you mean exactly?
I have to run part of this by guesswork, because it's based on what I could observe at the time. Never had the courage to dive in to the actual postgres source code, but my educated guess is that it's a side effect of the MVCC model.
Combination of: streaming replication; long-running reads on a replica; lots[þ] of writes to the primary. While the read in the replica is going it will generate a temporary table under the hood (because the read "holds the table open by point in time"). Something in this scenario leaked the state from replica to primary, because after several hours the primary would error out, and the logs showed that it failed to write because the old table was held in place in the replica and the two tables had deviated too far apart in time / versions.
It has seared to my memory because the thing just did not make any sense, and even figuring out WHY the writes had stopped at the primary took quite a bit of digging. I do remember that when the read at the replica was forcefully terminated, the primary was eventually released.
þ: The ballpark would have been tens of millions of rows.