← Back to context

Comment by saltcured

1 day ago

I'd say that the most reliable way is to use some mutable lifecycle metadata other than times to identify work. An indexed query will find the "new and unclaimed" work items and process them, regardless of their potentially backdated temporal metadata.

Updates of the lifecycle properties can also help coordinate multiple pollers so that they never work on the same item, but they can have overlapping query terms so that each poller is capable of picking up a particular item in the absence of others getting there first.

You also need some kind of lease/timeout policy to recognize orphaned items. I.e. claimed in the DB but not making progress. Workers can and should have exception handling and compensating updates to report failures and put items "back in the queue", but worst case this update may be missing. Some process, or even some human operator, needs to eventually compensate on behalf of the AWOL worker.

In my view, you always need this kind of table-scanning logic, even if using something like AMQP for work dispatch. You get in trouble when you fool yourself into imagining "exactly once" semantics actually exists. The message-passing layer could opportunistically scale out the workload, but a relational backstop can make sure that the real system of record is coherent and reflecting the business goals. Sometimes, you can just run this relational layer as the main work scheduler and skip the whole message-passing build-out.

The problem is that you now have to poll based on an index (maybe BRIN isn't too bad though) and you have to overwrite the row afterwards and update the index. That means you are creating a dead tuple for every row (and one more if you mark it to be "completed").

  • Yes, everything is tradeoffs.

    When trying to make good use of RDMBS transactional semantics, I think an important mental shift is to think of there being multiple async processing domains rather than a single magical transaction space. DB transactions are just communication events, not actual business work. This is how the relational DB can become the message broker.

    The agents need to do something akin to 2-phase commit protocols to record their "intent" and their "result" across different business resources. But, for a failure-prone, web style network of agents, I would not expose actual DB 2-phase commit protocols. Instead, the relational model reifies the 2-phase-like state ambiguity of particular business resources as tuples, and the agents communicate important phases of their work process with simpler state update transactions.

    It's basically the same pattern as with safe use of AMQP, just replacing one queue primitive with another. Both approaches require delayed acknowledgement patterns, so tasks can be routed to an agent but not removed from the system until after the agent reports the work complete. Either approach has an lost or orphaned task hazard if naively written to dequeue tasks earlier in the work process. An advantage of the RDBMS-based message broker is that you can use also use SQL to supervise all the lifecycle state, or even intervene to clean up after agent failures.

    In this approach, don't scale-up a central RDMBS by disabling all its useful features in a mad dash for speed. Instead, think of the network of async agents (human or machine) and RDMBS message broker(s) to make for their respective traffic. This agent network and communication workload can often be partitioned to reach scaling goals. E.g. specific business resources might go into different "home" zones with distinct queues and agent pools. Their different lifecycle states do not need to exist under a single, common transaction control.