← Back to context

Comment by n_u

13 hours ago

Cool! I'd love to know a bit more about the replication setup. I'm guessing they are doing async replication.

> We added nearly 50 read replicas, while keeping replication lag near zero

I wonder what those replication lag numbers are exactly and how they deal with stragglers. It seems likely that at any given moment at least one of the 50 read replicas may be lagging cuz CPU/mem usage spike. Then presumably that would slow down the primary since it has to wait for the TCP acks before sending more of the WAL.

> would slow down the primary since it has to wait for the TCP acks

Other than keeping around more WAL segments not sure why it would slow down the primary?

  • If you use streaming replication (ie. WAL shipping over the replication connection), a single replica getting really far behind can eventually cause the primary to block writes. Some time back I commented on the behaviour: https://news.ycombinator.com/item?id=45758543

    You could use asynchronous WAL shipping, where the WAL files are uploaded to an object store (S3 / Azure Blob) and the streaming connections are only used to signal the position of WAL head to the replicas. The replicas will then fetch the WAL files from the object store and replay them independently. This is what wall-g does, for a real life example.

    The tradeoffs when using that mechanism are pretty funky, though. For one, the strategy imposes a hard lower bound to replication delay because even the happy path is now "primary writes WAL file; primary updates WAL head position; primary uploads WAL file to object store; replica downloads WAL file from object store; replica replays WAL file". In case of unhappy write bursts the delay can go up significantly. You are also subject to any object store and/or API rate limits. The setup makes replication delays slightly more complex to monitor for, but for a competent engineering team that shouldn't be an issue.

    But it is rather hilarious (in retrospect only) when an object store performance degdaration takes all your replicas effectively offline and the readers fail over to getting their up-to-date data from the single primary.

    • Yeah, you'll definitely want to set things like `max_standby_streaming_delay` and friends to ensure things are bound correctly.

    • There is no backpressure from replication and streaming replication is asynchronous by default. Replicas can ask the primary to hold back garbage collection (off by default), which will eventually cause a slow down, but not blocking. Lagging replicas can also ask the primary to hold onto WAL needed to catch up (again, off by default), which will eventually cause disk to fill up, which I guess is blocking if you squint hard enough. Both will take considerable amount of time and are easily averted by monitoring and kicking out unhealthy replicas.

    • > If you use streaming replication (ie. WAL shipping over the replication connection), a single replica getting really far behind can eventually cause the primary to block writes. Some time back I commented on the behaviour: https://news.ycombinator.com/item?id=45758543

      I'd like to know more, since I don't understand how this could happen. When you say "block", what do you mean exactly?