Comment by diarrhea

2 days ago

This introduces long-running transactions, which at least in Postgres should be avoided.

12 comments

diarrhea

Depends what else you’re running on it; it’s a little expensive, but not prohibitively so.

maxbond 1 day ago
Long running transactions interfere with vacuuming and increase contention for locks. Everything depends on your workload but a long running transactions holding an important lock is an easy way to bring down production.
- iglio 1 day ago
  
  If the system is already using SQS, DynamoDB has this locking library which is lighter weight for this use case
  https://github.com/awslabs/amazon-dynamodb-lock-client
  > The AmazonDynamoDBLockClient is a general purpose distributed locking library built on top of DynamoDB. It supports both coarse-grained and fine-grained locking.

I read too many "use Postgres as your queue (pgkitchensink is in beta)", now I'm learning listen/notify is a strain, and so are long transactions. Is there a happy medium?

bmn__ 1 day ago
Just stop worrying and use it. If and when you actually bump into the limitations, then it's time to sit down and think and find a supplement or replacement for the offending part.
- rco8786 1 day ago
  
  Excellent advice across many domains/techs here.

thewisenerd 1 day ago

t1: select for update where status=pending, set status=processing

t2: update, set status=completed|error

these are two independent, very short transactions? or am i misunderstanding something here?

edit:

i think i'm not seeing what the 'transaction at start of processor' logic is; i'm thinking more of a polling logic

    while true:
      r := select for update
      if r is None:
        return
      sleep a bit

this obviously has the drawback of knowing how long to sleep for; and tasks not getting "instantly" picked up, but eh, tradeoffs.

diarrhea 13 hours ago
Your version makes sense. I understood the OP's approach as being different.
Two (very, if indexed properly) short transactions at start and end are a good solution. One caveat is that the worker can die after t1, but before t2 - hence jobs need a timeout concept and should be idempotent for safe retrying.
This gets you "at least once" processing.
> this obviously has the drawback of knowing how long to sleep for; and tasks not getting "instantly" picked up, but eh, tradeoffs.
Right. I've had success with exponential backoff sleep. In a busy system, means sleeps remain either 0 or very short.
Another solution is Postgres LISTEN/NOTIFY: workers listen for events and PG wakes them up. On the happy path, this gets instant job pickup. This should be allowed to fail open and understood as a happy path optimization.
As delivery can fail, this gets you "at most once" processing (which is why this approach by itself it not enough to drive a persistent job queue).
A caveat with LISTEN/NOTIFY is that it doesn't scale due to locking [1].
[1]: https://www.recall.ai/blog/postgres-listen-notify-does-not-s...
- maxbond 13 hours ago
  
  What are you thoughts on using Redis Streams or using a table instead of LISTEN/NOTIFY (either a table per topic or a table with a compound primary key that includes a topic - possibly a temporary table)?
  
  1 reply →
maxbond 1 day ago
They're proposing doing it in one transaction as a heartbeat.
> - If you find an unlocked task in 'executing', you know the processor died for sure. No heuristic needed
- diarrhea 14 hours ago
  
  Yes, and that cannot work: if a task is unlocked but in 'executing' state, how was it unlocked but its state not updated?
  If a worker/processor dies abruptly, it will neither unlock nor set the state appropriately. It won't have the opportunity. Conceptually, this failure mode can always occur (think, power loss).
  If such a disruption happened, yet you later find tasks unlocked, they must have been unlocked by another system. Perhaps Postgres itself, with a killer daemon to kill long-running transactions/locks. At which point we are back to square one: the job scheduling should be robust against this in the first place.