Comment by leontrolski
2 days ago
I'd be interested as to how dumb-ol' polling would compare here (the FOR UPDATE SKIP LOCKED method https://leontrolski.github.io/postgres-as-queue.html). One day I will set up some benchmarks as this is the kind of thing people argue about a lot without much evidence either way.
Wasn't aware of this AccessExclusiveLock behaviour - a reminder (and shameless plug 2) of how Postgres locks interact: https://leontrolski.github.io/pglockpy.html
My colleague did some internal benchmarking and found that LISTEN/NOTIFY performs well under low to moderate load, but doesn't scale well with a large number of listeners. Our findings were pretty consistent with this blog post.
(Shameless plug [1]) I'm working on DBOS, where we implemented durable workflows and queues on top of Postgres. For queues, we use FOR UPDATE SKIP LOCKED for task dispatch, combined with exponential backoff and jitter to reduce contention under high load when many workers are polling the same table.
Would love to hear feedback from you and others building similar systems.
[1] https://github.com/dbos-inc/dbos-transact-py
Nice! I'm using DBOS and am a little active on the discord. I was just wondering how y'all handled this under the hood. Glad to hear I don't have to worry much about this issue
Why not read the WAL?
We considered using WAL for change tracking in DBOS, but it requires careful setup and maintenance of replication slots, which may lead to unbounded disk growth if misconfigured. Since DBOS is designed to bolt onto users' existing Postgres instances (we don't manage their data), we chose a simpler, less intrusive approach that doesn't require a replication setup.
Plus, for queues, it's so much easier to leverage database constraints and transactions to implement global concurrency limit, rate limit, and deduplication.
Polling is the way to go, but it's also very tricky to get right. In particular, it's non-trivial to make a reliable queue that's also fast when transactions are held open and vacuum isn't able to clean tuples. E.g. "get the first available tuple" might have to skip over 1000s of dead tuples.
Holding transactions open is an anti-pattern for sure, but it's occasionally useful. E.g. pg_repack keeps a transaction open while it runs, and I believe vacuum also holds an open transaction part of the time too. It's also nice if your database doesn't melt whenever this happens on accident.
An approach that has worked for me is to hash partition the table and have each worker look for work in one partition at a time. There are a number of strategies depending on how you manage workers. This allows you to only consider 1/Nth of the dead tuples, where N is the number of partitions, when looking for work. It does come at the cost of strict ordering, but there are many use cases where strict ordering is not required. The largest scale implementation of this strategy that I have done had 128 partitions with a worker per partition pumping through ~100 million tasks per day.
I also found LISTEN/NOTIFY to not work well at this scale and used a polling based approach with a back off when no work was found.
Quite an interesting problem and a bit challenging to get right at scale.
Can't change the number of partition dynamically.
Additional challenge if jobs comes in funny sizes
1 reply →
If there were a toy or other public implementation of this, I would love to see it.
This is how Kafka does it. Kafka has spent years working on the rough edges (e.g. partition resizing), haven't used it recently though.
Dead tuples is a real and significant problem, not just because it has to skip the tuples, but because the statistics that drive the planner don't account for them.
I found this out the hard way when I had a simple query that suddenly got very, very slow on a table where the application would constantly do a `SELECT ... FOR UPDATE SKIP LOCKED` and then immediately delete the rows after a tiny bit of processing.
It turned out that with a nearly empty table of about 10-20k dead tuples, the planner switched to using a different index scan, and would overfetch tons of pages just to discard them, as they only contained dead tuples. What I didn't realize is that the planner statistics doesn't care about dead tuples, and ANALYZE doesn't take them into account. So the planner started to think the table was much bigger than it actually was.
It's really important for these uses cases to tweak the autovacuum settings (which can be set on a per-table basis) to be much more aggressive, so that under high load, the vacuum runs pretty much continuously.
Another option is to avoid deleting rows, but instead use a column to mark rows as complete, which together with a partial index can avoid dead tuples. There are both pros and cons; it requires doing the cleanup (and VACUUM) as a separate job.
Unfortunately, updating the row also creates dead tuples. It's very tricky!
5 replies →
> also fast when transactions are held open
In my linked example, on getting the item from the queue, you immediately set the status to something that you're not polling for - does Postgres still have to skip past these tuples (even in an index) until they're vacuumed up?
I have implemented polling against a cluster of mixed mariadb/mysql databases which do not offer listen/notify. It was a pain in the neck to get right.
- The batch size needs to be adaptative for performance, latency, and recovering smoothly after downtime.
- The polling timeouts, frequency etc the same.
- You need to avoid hysteresis.
- You want to be super careful about not disturbing the main application by placing heavy load on the database or accidentally locking tables/rows
- You likely want multiple distributed workers in case of a network partition to keep handling events
It’s hard to get right especially when the databases at the time did not support SKIP LOCKED.
In retrospect I wish I had listened to the WAL. Much easier.
Have you played with pgmq? It's pretty neat: https://github.com/pgmq/pgmq
Another thing for @leontrolski to add to the benchmarks - which I cannot wait to read.
There's a pretty cool solution built on pgmq called pgflow:
https://www.pgflow.dev/concepts/how-pgflow-works
I use polling with back off up to one minute. So when a workload is done, it immediately polls for more work. If nothing found, wait for 5 seconds, still nothing 10 seconds, ... until one minute and from then on it polls every minute until it finds work again and the back off timer resets to 0 again.
Instead of LISTEN/NOTIFY you could listen to the wal / logical replication stream.
Or you could have a worker whose only job is to listen to the wal / logical replication stream and then NOTIFY. Being the only one to do so would not burden other transactions.
Or you could have a worker whose only job is to listen to the wal / logical replication stream and then publish on some non-PG pubsub system.
With that experience behind you, would you have feedback for Chancy[1]? It aims to be a batteries-included offering for postgres+python, aiming for hundreds of millions of jobs a day, not massive horizontal worker scaling.
It both polls (configurable per queue) and supports listen/notify simply to inform workers that it can wake up early to trigger polling, and this can be turned off globally with a notifications=false flag.
[1]: https://github.com/tktech/chancy
I'll take the shameless plug. Thank you for putting this together! Very helpful overview of pg locks.
It's funny how "shameless plug" actually means "excuse the self-promotion" and implies at least a little bit of shame even when the reference is appropriate and on-topic.
Ping requires something persistent to check. That requires creating tuples, and most likely deleting them after they’ve been consumed. That puts pressure on the database and requires vacuuming in ways that pubsub doesn’t because it’s entirely ephemeral.
Not to mention that pubsub allows multiple consumers for a single message, whereas FOR UPDATE is single consumer by design.