Absurd Workflows: Durable Execution with Just Postgres

3 months ago (lucumr.pocoo.org)

I've been keeping an eye on this space for awhile as it matures a bit further. There's been a number of startups that have popped up around this - apart from Temporal and DBOS, Hatchet.run looked interesting.

I've been using BullMQ for awhile with distributed workers across K8 and have hacked together what I need, but a lightweight DAG of some sort on Postgres would be great.

I took a brief look at your docs. What would you say is the main difference of yours vs some of the other options? Just the simplicity of it being a single sql file and a sdk wrapper? Sorry if the docs answer this already - trying to take a quick look between work.

  • > I took a brief look at your docs. What would you say is the main difference of yours vs some of the other options? Just the simplicity of it being a single sql file and a sdk wrapper? Sorry if the docs answer this already - trying to take a quick look between work.

    It's really just trying to be as simple as possible. I was motivated by trying to just do the most simple thing I could come up with after I did not really find the other solutions to be something I wanted to build on.

    I'm sure they are great, but I want to leave the window open to having people self host what we are building / enable us to deploy a cellular architecture later and thus I want to stick to a manageable number of services until until I can no longer. Postgres is a known quantity in my stack and the only postgres only solution was DBOS which unfortunately did not look ready for prime time yet when I tried it. That said, I noticed that DBOS is making quite some progress so I'm somewhat confident that it will eventually get there.

    • Could you provide some more specifics as to why DBOS isn’t “ready for prime time”? Would love to know what you think is missing!

      FWIW DBOS is already in production at multiple Fortune 500 companies.

      3 replies →

Wow. Everything old is new again. I built a business state machine for a bespoke application using Oracle 8i and their stateful queues back in 2005. I had re-architected a batch-driven application (which couldn't scale temporally i.e. we had a bunch of CPU sitting near idle for a lot of the time), and turned it into an event driven solution. CPU usage became almost a horizontal line, saving us lots of money as we scaled (for the record, "scale" for this solution was writing 5M records a day into a partitioned table where we kept 13 months of data online, and then billed on it). Durable execution was just one of the many benefits we got out of this architecture. Love it.

  • It's quite funny in a way for me because even back in the Cadence days I thought it was the hottest shit ever, but it was just too complex to run for a small company and cadence was not the first (SWF and others came before). It felt like unless you had really large workflows you would ignore these systems entirely. And now, due to the problems that agents pose, we're all in need of that.

    I'm happy it's slowly moving towards mass appeal, but I hope we find some simple solutions like Absurd too.

Somebody said this the other day on HN, but we really are living in the golden age of Postgres.

Armin, I managed to review absurd.sql and the migrations. I am so impressed that I am rewriting the state management of my workflow engine with Absurd. Just wanted to thank you for sharing it with us. I'll keep you posted of the outcome.

This is pretty great! The main thing you need for durable execution is 1) retries (absurd does this) 2) idempotency (absurd does this via steps - but would be better handled with the APIs themselves being idempotent, then not using steps. Though absurd would certainly _help_ mitigate some APIs not being idempotent, but not completely).

  • > idempotency (absurd does this via steps - but would be better handled with the APIs themselves being idempotent, then not using steps

    That is very hard to do with agents which are just all probabilistic. However if you do have an API that is either idempotent / uses idempotency keys you can derive an idempotency key from the task: const idempotencyKey = `${ctx.taskID}:payment`;

    That said: many APIs that support the idempotency-key header only support replays of an hour to 24 hours, so for long running workflows you need to capture the state output anyways.

    • I was not thinking of the agent case specifically. But yes, you have to make the APIs idempotent, either with these step checkpoints or by wrapping the underlying API. It's not hard to make a postgres-transaction-based idempotency layer wrapper, then you can have a much longer idempotency TTL.

      > so for long running workflows you need to capture the state output anyways.

      That would be a _very_ long running workflow. Probably worth breaking up into different subtasks or, I guess as Absurd does it, step checkpoints.

Did anyone have a new approach to do this kind of transactional workflow? I heard that Saga patterns also define invertibility as well but I want a more general framework that also does all of this in one.

Also, I noticed how durable execution actually have so much to do with Continuation-passing style, is my intuition correct?

In what sense are these durable, given that they restart from the beginning if the server process crashes?

  • Because the finished steps have their state stored so they don't repeat

    • Except they don't? Look at the example that runs an AI conversation. It always starts from the system prompt. At no point does it load an old conversation from a database.

Restate was built for agents before agents were cool.

Surprisingly haven take off yet when agents is all we are looking for now.

Reminder that Postgres does not have a monopoly on SKIP LOCKED

You can do that in Oracle, SQL server and MySQL too.

In fact you might be able to replicate what Armin is doing with SQLite because it too works just fine as a queue though no via SKIP LOCKED.

Other question: why reimplementing your framework, rather than using an existing agent framework like Claude + MCP, or OpenAI + tool calling? Is it because you're using your own LM models, or just because you wanted more control on retries, etc?

  • There are not that many agent frameworks around at the moment. If you want to be provider independent you most likely either use pydantic AI or the vercel AI SDK would be my guess. Neither one have built-in solution for durable execution so you end up driving the loop yourself. So it's not that I don't use these SDKs, it's just that I need to drive the loop myself.

    • Okay very clear! I was saying that because your post example is just a kind of basic "tool use" example which is already implemented by MCP/OpenAI tool use, but obviously I guess your code can be suited to more complex scenarios

      Two small questions:

      1. in your README you give this example for durable execution:

      const shipment = await ctx.awaitEvent(`shipment.packed:${params.orderId}`);

      I was just wondering, how does it work? I was more expecting a generator with a `yield` statement to run "long-running tasks" in the background... otherwise is the node runtime keeping the thread running with the await? doesn't this "pile up"?

      2. would your framework be suited to long-running jobs with multiple steps? I have sometimes big jobs running in the background on all of my IoT devices, eg:

      for each d in devices: doSomeWork(d)

      and I'd like to run the big outerloop each hour (say), but only if the previous one is complete (eg max num of workers per task = 1), and that the inner-loop be some "steps" that can be cached, but can be retried if they fail

      would your framework be suited for that? or is that just a simpler use-case for pgmq and I don't need the Absurd framework?

      2 replies →