Comment by jashmatthews

1 year ago

Putting low throughput queues in the same DB is great both for simplicity and for getting exactly-once-processing.

Putting high throughput queues in Postgres sucks because...

No O(1) guarantee to get latest job. Query planner can go haywire.

High update tables bloat like crazy. Needs a whole new storage engine aka ZHEAP

Write amplification as every update has to update every index

LISTEN/NOTIFY doesn't work through connection pooling

Update-related throughput and index problems are only a problem if you update tables. You can use an append-only structure to mitigate some of that: insert new entries with the updated statuses instead. You gain the benefit of history also. You can even coax the index into holding non-key values for speed with INCLUDE to CREATE INDEX.

You can then delete the older rows when needed or as required.

Query planner issues are a general problem in postgres and is not unique to this problem. Not sure what O(1) means in this context. I am not sure pg has ever been able to promise constant-time access to anything; indeed, with an index, it'd never be asymptotically upper bounded as constant time at all?

  • By the time you need append-only job statuses it's better to move to a dedicated queue. Append-only statuses help but they also make the polling query a lot more expensive.

    Deleting older rows is a nightmare at scale. It leaves holes in the earlier parts of the table and nerfs half the advantage of using append-only in the first place. You end up paying 8kb page IO costs for a single job.

    Dedicated queues have constant time operations for enqueue and dequeue which don't blow up at random times.

    • With a partitioned table you can painlessly remove old rows. Of course, you then have to maintain your partitions, but that's trivial.

      1 reply →

    • partitions are often used to drop old data in constant time.

      They can also help to mitigate io issues if you use your insertion timestamp as the partition key and include it in your main queries.

      1 reply →

Indeed, that's my experience too. We used partitions like others mentioned below, but Postgres had issues with moving rows across tables atomically and had to implement our custom complex queries to overcome it. Plus job expiration was dynamic and had to use background cleaning. The bigger problem was with the planner not able to pick up sudden changes in volume and had to use a cron to run analyze on it. Managing retries with backoffs, etc.. At some point we stopped fighting it and just moved to SQS, we have zero problems since, no maintenence needed, and it's still free so we saved storage cost, time and developer effort for ongoing maintenance.

We still use Postgres for simple queues, but those don't really require a library as it's quite simple usually, with some advisory locks we can handle the crashed job unlocking fairly well too.

> LISTEN/NOTIFY doesn't work through connection pooling

What's the problem with using it with connection pooling?