Comment by parthdesai
2 days ago
What happens when transaction succeeds but the execution of NOTIFY fails if it's outside of transaction, in it's own separate connection?
2 days ago
What happens when transaction succeeds but the execution of NOTIFY fails if it's outside of transaction, in it's own separate connection?
For reliability, you can make the recipient poll the table(s) of record for relevant state and use the out-of-band notification channel as a latency-reducer. So, the poller is eventually consistent at some configured polling interval, but opportunistically can respond much sooner when told to check again ahead of the next scheduled poll time.
In my experience, this means you make sure the polling solution is complete and correct, and the notifier gets reduced to a wake-up signal. This signal doesn't even need to carry the actionable change content, if the poller can already pose efficient queries for whatever "new stuff" it needs.
This approach also allows the poller to keep its own persistent cursor state if there is some stateful sequence to how it consumes the DB content. It automatically resynchronizes and the notification channel does not need to be kept in lock-step with the consumption.
> you can make the recipient poll the table(s) of record for relevant state
That is tricky due to transactions and visibility. How do you write the poller to not miss events that were written by a long/blocked transaction? You'd have to set the poller scan to a long time (e.g. "process events that were written since now minus 5minutes") and then make sure transactions are cancelled hard before those 5minutes.
I'd say that the most reliable way is to use some mutable lifecycle metadata other than times to identify work. An indexed query will find the "new and unclaimed" work items and process them, regardless of their potentially backdated temporal metadata.
Updates of the lifecycle properties can also help coordinate multiple pollers so that they never work on the same item, but they can have overlapping query terms so that each poller is capable of picking up a particular item in the absence of others getting there first.
You also need some kind of lease/timeout policy to recognize orphaned items. I.e. claimed in the DB but not making progress. Workers can and should have exception handling and compensating updates to report failures and put items "back in the queue", but worst case this update may be missing. Some process, or even some human operator, needs to eventually compensate on behalf of the AWOL worker.
In my view, you always need this kind of table-scanning logic, even if using something like AMQP for work dispatch. You get in trouble when you fool yourself into imagining "exactly once" semantics actually exists. The message-passing layer could opportunistically scale out the workload, but a relational backstop can make sure that the real system of record is coherent and reflecting the business goals. Sometimes, you can just run this relational layer as the main work scheduler and skip the whole message-passing build-out.
fwiw - that's what Oban did for the most part. It sent a signal to a worker that there was a new job to pick up and work on. At scale, even that was an issue.
The same thing that happens if the notified process dies suddenly.
If you're not handling that, then whatever you're doing is unreliable either way.
98% of developers can't see it