Comment by mickeyp

1 year ago

Update-related throughput and index problems are only a problem if you update tables. You can use an append-only structure to mitigate some of that: insert new entries with the updated statuses instead. You gain the benefit of history also. You can even coax the index into holding non-key values for speed with INCLUDE to CREATE INDEX.

You can then delete the older rows when needed or as required.

Query planner issues are a general problem in postgres and is not unique to this problem. Not sure what O(1) means in this context. I am not sure pg has ever been able to promise constant-time access to anything; indeed, with an index, it'd never be asymptotically upper bounded as constant time at all?

5 comments

mickeyp

jashmatthews 1 year ago

By the time you need append-only job statuses it's better to move to a dedicated queue. Append-only statuses help but they also make the polling query a lot more expensive.

Deleting older rows is a nightmare at scale. It leaves holes in the earlier parts of the table and nerfs half the advantage of using append-only in the first place. You end up paying 8kb page IO costs for a single job.

Dedicated queues have constant time operations for enqueue and dequeue which don't blow up at random times.

felixyz 1 year ago
With a partitioned table you can painlessly remove old rows. Of course, you then have to maintain your partitions, but that's trivial.
- jashmatthews 1 year ago
  
  It's far from trivial. Autoanalyze doesn't work on partitioned tables, only on the partitions themselves. Partitioning a busy job queue table is a nightmare in itself.
iTokio 1 year ago
partitions are often used to drop old data in constant time.
They can also help to mitigate io issues if you use your insertion timestamp as the partition key and include it in your main queries.
- jashmatthews 1 year ago
  
  Yeah the ULID/UUIDs which can be be partitioned by time in this way are AWESOME for these use cases.