Comment by singron

7 months ago

Polling is the way to go, but it's also very tricky to get right. In particular, it's non-trivial to make a reliable queue that's also fast when transactions are held open and vacuum isn't able to clean tuples. E.g. "get the first available tuple" might have to skip over 1000s of dead tuples.

Holding transactions open is an anti-pattern for sure, but it's occasionally useful. E.g. pg_repack keeps a transaction open while it runs, and I believe vacuum also holds an open transaction part of the time too. It's also nice if your database doesn't melt whenever this happens on accident.

13 comments

singron

time0ut 7 months ago

An approach that has worked for me is to hash partition the table and have each worker look for work in one partition at a time. There are a number of strategies depending on how you manage workers. This allows you to only consider 1/Nth of the dead tuples, where N is the number of partitions, when looking for work. It does come at the cost of strict ordering, but there are many use cases where strict ordering is not required. The largest scale implementation of this strategy that I have done had 128 partitions with a worker per partition pumping through ~100 million tasks per day.

I also found LISTEN/NOTIFY to not work well at this scale and used a polling based approach with a back off when no work was found.

Quite an interesting problem and a bit challenging to get right at scale.

j16sdiz 7 months ago
Can't change the number of partition dynamically.
Additional challenge if jobs comes in funny sizes
- AlisdairO 7 months ago
  
  Depending on exactly what you need, you can often fake this with a functional index on mod(queue_value_id, 5000). You then query for mod(queue_value_id,5000) between m and n. You can then dynamically adjust the gap between m and n based on how many partitions you want
dfsegoat 7 months ago

If there were a toy or other public implementation of this, I would love to see it.
CBLT 7 months ago

This is how Kafka does it. Kafka has spent years working on the rough edges (e.g. partition resizing), haven't used it recently though.

atombender 7 months ago

Dead tuples is a real and significant problem, not just because it has to skip the tuples, but because the statistics that drive the planner don't account for them.

I found this out the hard way when I had a simple query that suddenly got very, very slow on a table where the application would constantly do a `SELECT ... FOR UPDATE SKIP LOCKED` and then immediately delete the rows after a tiny bit of processing.

It turned out that with a nearly empty table of about 10-20k dead tuples, the planner switched to using a different index scan, and would overfetch tons of pages just to discard them, as they only contained dead tuples. What I didn't realize is that the planner statistics doesn't care about dead tuples, and ANALYZE doesn't take them into account. So the planner started to think the table was much bigger than it actually was.

It's really important for these uses cases to tweak the autovacuum settings (which can be set on a per-table basis) to be much more aggressive, so that under high load, the vacuum runs pretty much continuously.

Another option is to avoid deleting rows, but instead use a column to mark rows as complete, which together with a partial index can avoid dead tuples. There are both pros and cons; it requires doing the cleanup (and VACUUM) as a separate job.

singron 7 months ago
Unfortunately, updating the row also creates dead tuples. It's very tricky!
- atombender 7 months ago
  
  It does, but because of how indexes work, I believe it won't be skewed by the presence of dead tuples (though the bloat can cause the live dat to be spread across a lot more blocks and therefore generate more I/O) as long as you run autoanalyze semi-regularly.
  
  4 replies →

leontrolski 7 months ago

> also fast when transactions are held open

In my linked example, on getting the item from the queue, you immediately set the status to something that you're not polling for - does Postgres still have to skip past these tuples (even in an index) until they're vacuumed up?