Comment by Felk
6 months ago
I see that the author took a 'heuristical' approach for retrying tasks (having a predetermined amount of time a task is expected to take, and consider it failed if it wasn't updated in time) and uses SQS. If the solution is homemade anyway, I can only recommend leveraging your database's transactionality for this, which is a common pattern I have often seen recommend and also successfully used myself:
- At processing start, update the schedule entry to 'executing', then open a new transansaction and lock it, while skipping already locked tasks (`SELECT FOR UPDATE ... SKIP LOCKED`).
- At the end of processing, set it to 'COMPLETED' and commit. This also releases the lock.
This has the following nice characteristics:
- You can have parallel processors polling tasks directly from the database without another queueing mechanism like SQS, and have no risk of them picking the same task.
- If you find an unlocked task in 'executing', you know the processor died for sure. No heuristic needed
This introduces long-running transactions, which at least in Postgres should be avoided.
Depends what else you’re running on it; it’s a little expensive, but not prohibitively so.
Long running transactions interfere with vacuuming and increase contention for locks. Everything depends on your workload but a long running transactions holding an important lock is an easy way to bring down production.
1 reply →
I read too many "use Postgres as your queue (pgkitchensink is in beta)", now I'm learning listen/notify is a strain, and so are long transactions. Is there a happy medium?
Just stop worrying and use it. If and when you actually bump into the limitations, then it's time to sit down and think and find a supplement or replacement for the offending part.
1 reply →
t1: select for update where status=pending, set status=processing
t2: update, set status=completed|error
these are two independent, very short transactions? or am i misunderstanding something here?
--
edit:
i think i'm not seeing what the 'transaction at start of processor' logic is; i'm thinking more of a polling logic
this obviously has the drawback of knowing how long to sleep for; and tasks not getting "instantly" picked up, but eh, tradeoffs.
Your version makes sense. I understood the OP's approach as being different.
Two (very, if indexed properly) short transactions at start and end are a good solution. One caveat is that the worker can die after t1, but before t2 - hence jobs need a timeout concept and should be idempotent for safe retrying.
This gets you "at least once" processing.
> this obviously has the drawback of knowing how long to sleep for; and tasks not getting "instantly" picked up, but eh, tradeoffs.
Right. I've had success with exponential backoff sleep. In a busy system, means sleeps remain either 0 or very short.
Another solution is Postgres LISTEN/NOTIFY: workers listen for events and PG wakes them up. On the happy path, this gets instant job pickup. This should be allowed to fail open and understood as a happy path optimization.
As delivery can fail, this gets you "at most once" processing (which is why this approach by itself it not enough to drive a persistent job queue).
A caveat with LISTEN/NOTIFY is that it doesn't scale due to locking [1].
[1]: https://www.recall.ai/blog/postgres-listen-notify-does-not-s...
2 replies →
They're proposing doing it in one transaction as a heartbeat.
> - If you find an unlocked task in 'executing', you know the processor died for sure. No heuristic needed
1 reply →
Don't have to keep transaction open. What I do is:
1. Select next job
2. Update status to executing where jobId = thatJob and status is pending
3. If previous affected 0 rows, you didn't get the job, go back to select next job
If you have "time to select" <<< "time to do" this works great. But if you have closer relationship you can see how this is mostly going to have contention and you shouldn't do it.
This is exactly what we're doing. Works like a charm.