Comment by Felk
2 days ago
I see that the author took a 'heuristical' approach for retrying tasks (having a predetermined amount of time a task is expected to take, and consider it failed if it wasn't updated in time) and uses SQS. If the solution is homemade anyway, I can only recommend leveraging your database's transactionality for this, which is a common pattern I have often seen recommend and also successfully used myself:
- At processing start, update the schedule entry to 'executing', then open a new transansaction and lock it, while skipping already locked tasks (`SELECT FOR UPDATE ... SKIP LOCKED`).
- At the end of processing, set it to 'COMPLETED' and commit. This also releases the lock.
This has the following nice characteristics:
- You can have parallel processors polling tasks directly from the database without another queueing mechanism like SQS, and have no risk of them picking the same task.
- If you find an unlocked task in 'executing', you know the processor died for sure. No heuristic needed
This introduces long-running transactions, which at least in Postgres should be avoided.
Depends what else you’re running on it; it’s a little expensive, but not prohibitively so.
Long running transactions interfere with vacuuming and increase contention for locks. Everything depends on your workload but a long running transactions holding an important lock is an easy way to bring down production.
1 reply →
I read too many "use Postgres as your queue (pgkitchensink is in beta)", now I'm learning listen/notify is a strain, and so are long transactions. Is there a happy medium?
Just stop worrying and use it. If and when you actually bump into the limitations, then it's time to sit down and think and find a supplement or replacement for the offending part.
1 reply →
t1: select for update where status=pending, set status=processing
t2: update, set status=completed|error
these are two independent, very short transactions? or am i misunderstanding something here?
--
edit:
i think i'm not seeing what the 'transaction at start of processor' logic is; i'm thinking more of a polling logic
this obviously has the drawback of knowing how long to sleep for; and tasks not getting "instantly" picked up, but eh, tradeoffs.
Your version makes sense. I understood the OP's approach as being different.
Two (very, if indexed properly) short transactions at start and end are a good solution. One caveat is that the worker can die after t1, but before t2 - hence jobs need a timeout concept and should be idempotent for safe retrying.
This gets you "at least once" processing.
> this obviously has the drawback of knowing how long to sleep for; and tasks not getting "instantly" picked up, but eh, tradeoffs.
Right. I've had success with exponential backoff sleep. In a busy system, means sleeps remain either 0 or very short.
Another solution is Postgres LISTEN/NOTIFY: workers listen for events and PG wakes them up. On the happy path, this gets instant job pickup. This should be allowed to fail open and understood as a happy path optimization.
As delivery can fail, this gets you "at most once" processing (which is why this approach by itself it not enough to drive a persistent job queue).
A caveat with LISTEN/NOTIFY is that it doesn't scale due to locking [1].
[1]: https://www.recall.ai/blog/postgres-listen-notify-does-not-s...
2 replies →
They're proposing doing it in one transaction as a heartbeat.
> - If you find an unlocked task in 'executing', you know the processor died for sure. No heuristic needed
1 reply →
This is exactly what we're doing. Works like a charm.
Don't have to keep transaction open. What I do is:
1. Select next job
2. Update status to executing where jobId = thatJob and status is pending
3. If previous affected 0 rows, you didn't get the job, go back to select next job
If you have "time to select" <<< "time to do" this works great. But if you have closer relationship you can see how this is mostly going to have contention and you shouldn't do it.