Running Durable Workflows in Postgres Using DBOS

1 month ago (supabase.com)

It is important to note that employing any workflow engine needs careful examination of the benefits vs. drawbacks.

You will need full and complete control over workflow orchestration, restarts, deletes, logging, versioning etc.

Your steps will get stuck, they will error out unexpectedly and you will need complete transparency and operational tools to deal with that.

In many cases you will be better off doing the book-keeping yourself. A process should be defined in terms of your domain data. Typically you have statuses (not the best choice, but I digress) that change over time, so you can report on the whole lifecycle of your shopping cart.

Now, and this is crucial, how you implement state changes - any state change - should exactly be the same for a "workflow" than for a non-workflow! It needs to be. A shipment is either not ready yet or done - this information should not be in a "workflow state".

Let's say you shut down your system and start it back up: do you have a mechanism in place that can "continue where it left off"? If so, you likely don't need a workflow engine.

In our case, on startup, we need to query the system for carts that are waiting for shipments and that are not being dealt with yet. Then fan out tasks for those.

That is robust in the face of changed data. If you employ a workflow engine, changed data always needs to two consider two worlds: your own domain data and any book-keeping that is potentially done in any workflow.

  • Building a complex stateful system will always be hard, but workflows as an abstraction have two big benefits:

    1. Automatic handling of any transient failures or service interruptions/crashes/restarts. Transient failures in steps are automatically retried, and service interruptions are automatically recovered from. Even if you're doing your own bookkeeping, doing _recovery_ from that bookkeeping isn't easy, and workflows do it automatically.

    2. Built-in observability. Workflows naturally support built-in observability and tooling that isn't easy to build yourself. DBOS integrates with OpenTelemetry, automatically generating complete traces of your workflows and giving you a dashboard to view them from (and because it's all OTel, you can also feed the traces into your existing obs infrastructure). So it's easier to spot a step getting stuck or failing unexpectedly.

    Another advantage of DBOS specifically, versus other workflow engines, is that all its bookkeeping is in Postgres and well-documented (https://docs.dbos.dev/explanations/system-tables), so you have full and complete control over your workflows if you need it.

  • This a thousand times over.

    Workflow engines are often complex beasts and rasterizing your business logic into them is nearly always a mistake unless your logic is very simple, in which case why do you need a workflow engine?

    They certainly do often make sense when the alternative is building your own generic workflow engine.

    Usually I see folks grasp for a workflow system because they don't want to think about or really understand their business logic. They're looking for a silver bullet.

  • Grounding to reality is key. There’s often a tendency to trust the map over the terrain. This architecture seems to promote relying on the database always have a perfect representation of state. But consider a scenario where a company restores a 12-hour-old backup, losing hours of state. States that can’t be externally revalidated in such cases are a serious concern.

    • It's an excellent point. In such a scenario, if you're bound to a rigid workflow system, you will probably have a hard time recreating all the intermediary steps required to get the system back into a consistent state with the external world.

      Idempotency is key and the choice of the idempotency key as well ;)

> DBOS has a special @DBOS.Transaction decorator. This runs the entire step inside a Postgres transaction. This guarantees exactly-once execution for databases transactional steps.

I stopped here. I know, authors really want to chase the exactly-once dragon, but this won't scale. If the step takes a long time, it'll keep the transaction open with it for that time. The master replica has to bookkeeping that. That state has to be replicated. That will also affect MVVC further on. As your scale grows, you'll see disk usage growing and eventually swapping, replica lags, AND vacuum halting for surges. I hope your uncalled engineers have a steady supply of coffee.

Disclaimer: I'm a co-founder of Hatchet (https://github.com/hatchet-dev/hatchet), which is a Postgres-backed task queue that supports durable execution.

> Because a step transition is just a Postgres write (~1ms) versus an async dispatch from an external orchestrator (~100ms), it means DBOS is 25x faster than AWS Step Functions

Durable execution engines deployed as an external orchestrator will always been slower than direct DB writes, but the 1ms delay versus ~100ms doesn't seem inherent to the orchestrator being external. In the case of Hatchet, pushing work takes ~15ms and invoking the work takes ~1ms if deployed in the same VPC, and 90% of that execution time is on the database. In the best-case, the external orchestrator should take 2x as long to write a step transition (round-trip network call to the orchestrator + database write), so an ideal external orchestrator would be ~2ms of latency here.

There are also some tradeoffs to a library-only mode that aren't discussed. How would work that requires global coordination between workers behave in this model? Let's say, for example, a global rate limit -- you'd ideally want to avoid contention on rate limit rows, assuming they're stored in Postgres, but each worker attempting to acquire a rate limit simultaneously would slow down start time significantly (and place additional load on the DB). Whereas with a single external orchestrator (or leader election), you can significantly increase throughput by acquiring rate limits as part of a push-based assignment process.

The same problem of coordination arises if many workers are competing for the same work -- for example if a machine crashes while doing work, as described in the article. I'm assuming there's some kind of polling happening which uses FOR UPDATE SKIP LOCKED, which concerns me as you start to scale up the number of workers.

  • You have a great point that scaling a library model requires careful design, but we think that's worth it to provide a superior developer experience.

    For example, the DBOS library provides an API to instruct a worker to recover specific tasks. In our hosted platform (DBOS Cloud), when a worker crashes, a central server uses this API to tell the workers what to recover. We like this design because it provides the best of both worlds--the coordination decision is centralized, so it's performant/scalable, but the actual recovery and workflow execution is done in-process, so DBOS doesn't turn your program into a distributed system the way Step Functions/Temporal do (I haven't used Hatchet).

    • Definitely agree that the dev experience is better with a library, particularly for lightweight and low-volume tasks (Hatchet is also moving in the same direction, we'll be releasing library-only mode this month). And I really like the transactional safety built into DBOS!

      My concern is as you start to see higher volume, more workers, or load patterns that don't correspond to your primary API. At that point, a dedicated database and service to orchestrate tasks starts to become more necessary.

      4 replies →

  • Great points. Besides performance, centralized coordination and distributed dataplane is better for operability of schedulers as well. Some examples - Being able to roll out new features in the scheduler, tracing scheduling behavior and decisions, deploying configuration changes.

    Even with a centralized scheduler it should be possible to create a DevEx that makes use of decorators to author workflows easily.

    We are doing that with Indexify(https://github.com/tensorlakeai/indexify) for authoring data intensive workflows to process unstructured data(documents, videos, etc) - it’s like Spark but uses Python instead of Scala/SQL/UDFs. Indexify’s scheduler is centralized and it uses RocksDB under the hood for persistence, and long term we are moving to a hybrid storage system - S3 for less frequently updated data, and SSD for read cache and frequently updated data(on going tasks).

    The scheduler’s latency for scheduling new tasks is consistently under < 800 microseconds on SSDs.

    This is how schedulers have been designed traditionally that have a proven track record of working in production - Borg, Hashicorp Nomad, etc. There are many ways a centralized scheduler can scale out beyond a single machine - parallel scheduling across different by sharding jobs, node pools, and then linearizing and deduplicating conflicts during writes is one such approach.

    Love DBOs and Hatchet! cheering for you @jedberg and @abelanger :-)

    Disclaimer - I am the founder of Tensorlake, and worked on Nomad and Apache Mesos in the past.

> # Exactly-once execution

> DBOS has a special @DBOS.Transaction decorator. This runs the entire step inside a Postgres transaction. This guarantees exactly-once execution for databases transactional steps.

Totally awesome, great work, just a small note... IME a lot of (most?) pg deployments have synchronous replication turned off because it is very tricky to get it to perform well[1]. If you have it turned off, pg could journal the step, formally acknowledge it, and then (as I understand DBOS) totally lose that journal when the primary fails, causing you to re-run the step.

When I was on call for pg last, failover with some data loss happened to me twice. So it does happen. I think this is worth noting because if you plan for this to be a hard requirement, (unless I'm mistaken) you need to set up sync replication or you need to plan for this to possibly fail.

Lastly, note that the pg docs[1] have this to say about sync replication:

> Synchronous replication usually requires carefully planned and placed standby servers to ensure applications perform acceptably. Waiting doesn't utilize system resources, but transaction locks continue to be held until the transfer is confirmed. As a result, incautious use of synchronous replication will reduce performance for database applications because of increased response times and higher contention.

I see the DBOS author around here somewhere so if the state of the art for DBOS has changed please do let me know and I'll correct the comment.

[1] https://www.postgresql.org/docs/current/warm-standby.html#SY...

  • Yeah, that's totally fair--DBOS is totally built on Postgres, so it can't provide stronger durability guarantees than your Postgres does. If Postgres loses data, then DBOS can lose data too. There's no way around that if you're using Postgres for data storage, no matter how you architect the system.

    • That’s my intuition as well, but it does raise a question in my mind.

      We have storage solutions that are far more robust than the individual hard drives that they’re built upon.

      One example that comes to mind is Microsoft Exchange databases. Traditionally these were run on servers that had redundant storage (RAID), and at some point Microsoft said you could run it without RAID, and let their Database Availability Groups handle the redundancy.

      With Postgres that would look like, say, during an HTTP request, you write the change to multiple Postgres instances, before acknowledging the update to the requesting client.

      1 reply →

This is very cool. How does this compare to tools like Celery and Dagster?

Like we might use Celery when we need to run specific tasks on specific workers (e.g. accessing some database that sits inside a corporate network, not accessible from the cloud). It seems like we can do something like that using DBOS queues or events, but is there a concept of running multiple workers in different locations?

Compared to Dagster, a DBOS workflow seems like a de facto DAG, running the steps in the right order, and being resumable if a step fails, and the difference here would be that with Dagster the steps are more granular, in the sense that we can re-run a specific step in isolation? In other words, a DBOS workflow will retry failed steps, but can we submit a request to “only re-run step 2”? This comes up often in working with ETL-style tasks, where steps 1 and 2 complete, but we need to resubmit steps 3 and 4 of a pipeline.

Dagster also provides nice visual feedback, where even a non-technical user can see that step 3 failed, and right-click it to materialize it. Maybe I need to play with the OpenTelemetry piece to see how that compares.

Celery and Dagster have their own drawbacks (heavier, complexity of more infrastructure, learning curve), so just trying to see where the edges are, and how far we could take a tool like DBOS. From an initial look, it seems like it could address a lot of these scenarios with less complexity.

  • Compared to Celery, DBOS provides a similar queuing abstraction (Docs: https://docs.dbos.dev/python/tutorials/queue-tutorial) DBOS tries to spread out queued tasks among all workers (on DBOS Cloud, the workers autoscale with load), but there isn't yet support for running specific tasks on specific workers. Would love to learn more about that use case!

    Compared to Dagster (or Prefect or Airflow), exactly like you said, a DBOS workflow is basically a more flexible and lightweight DAG. The visualization piece is something we're actively developing leveraging OpenTelemetry--look for some cool new viz features by the end of the month! I'm interested in the "retry step 2 only" or "retry from step 2" use cases--would love to learn more about them--we don't currently support them but easily could (because it's all just Postgres tables under the hood). If you're building in this space please reach out, would love to chat!

Does DBOS offer a way to get messages in or data out of running workflows? (Similar to signals/queries in Temporal) Interested in long-running workflows, and this particular area seemed to be lacking last time I looked into it.

I don’t want to just sleep; I want a workflow to be able to respond to an event.

Could someone clarify on how it can achieve exactly-once processing with idempotency key?

Using the example they provided:

1. Validate payment

2. Check inventory

3. Ship order

4. Notify customer

I'm curious about case when one of the operation times out. The workflow engine needs to then either time out or it may even crash before receving a response

In this scenario, the only option for the workflow is to retry with the same idempotency key it used, but this may re-execute the failed operation which may have been succeeded in the prior run because workflow did not receive the response. The succeeded operations would skip because workflow has run completion record for same idempotency key. Is that correct?

  • > The succeeded operations would skip because workflow has run completion record for same idempotency key. Is that correct?

    This sounds about right. But you need to make sure the service being called in that step is indeed idempotent, and will return the same response which it earlier couldn't in time.

Temporal can use postgres as a backend. What makes this product different?

  • DBOS Co-founder here. Great question! The big difference is that DBOS runs as a library inside your program, whereas the Temporal architecture is an external workflow server managing tasks on distributed workers. The advantages of DBOS are:

    1. Simpler architecturally. Just a normal Python process versus a workflow server coordinating multiple workers.

    2. Easier dev/debugging. A Temporal program is essentially distributed microservices, with control flow split between the workflow server and workers. That complicates any testing/debugging because you have to work through multiple components. Whereas DBOS is just your process.

    3. Performance. A state transition in DBOS requires only a database write (~1 ms) whereas in Temporal it requires an async dispatch from the workflow server (tens-hundreds of ms).

    • Having my app with direct write access to the scheduling backend could be... problematic from the separation of concerns. We'll have to deal with connection pooling etc and we typically prefer to separate DB logic (interaction with postgres) from application logic.

      Ignoring my setup, I'm a bit confused from the website if I can run this as a library? Seems that I need to use your SaaS backend to get started? Or did I misunderstand the flow from my quick flow?

      1 reply →

    • How does workflow orchestration work across multiple running processes of my application? Say I have 3 processes, and process-1 that has initiated a workflow crashes. How and where does this workflow resume?

      1 reply →

Maybe I'm not seeing it, but why do none of these "postgres durable packages" ever integrate with existing transactions? That's the biggest flaw of temporal, and seems so obvious that they can hook into transactions to ensure that starting a workflow is transactionally secure with other operations, and you don't have to manage idempotency yourself (or worry about handling "already started" errors and holding up DB connections bc you launched in transaction)

  • (DBOS co-founder here) DBOS does exactly this! From the post:

    DBOS has a special @DBOS.Transaction decorator. This runs the entire step inside a Postgres transaction. This guarantees exactly-once execution for databases transactional steps.

This appears to be an open source MIT library that anyone can just use. The docs say it can be run locally against PG. but then they say to use DBOS cloud in production without mentioning that it can also be ran in production against any PG database. Just checking that I am not missing something. I would like to support this project while using it but adding a vendor is not always possible (for example I don’t see SOC2 mentioned).

  • You're correct. You can just use the open source library if you want to. You will get the durability and some observability, but will lose out on the efficiency, reliability, and additional observability that the commercial platform provides. And of course you have to manage all the infrastructure yourself.

    FWIW DBOS is currently in the process of getting HIPAA and SOC compliance.

I'm having trouble in my head understanding how these workflows are actually executed in the open-source version. Are there workers somewhere? How do we provision those and scale those up and down?

  • You run it as a daemon. The example in the readme is a fastapi app, for example. You would scale them the same way as any other long running app -- either behind a load balancer like haproxy or nginx or some other app runner.

how is it different from Conductor? https://github.com/conductor-oss/conductor

  • I haven't used Conductor, but quickly looking at the README, Conductor lets you define JSON workflows to orchestrate existing microservices. By contrast, DBOS helps you build highly reliable applications--it runs as a library inside your program and helps you build durable workflows that run alongside your code and are written in the same language.

Putting your data and workflows close to the database sounds good on paper, but often the performance "gains" are not significant when considering the dangers of vendor or tech stack lock-in. TLDR; try not to do everything on one platform.

Bit disappointed, looked for .net core support but no.

Languages that are supported: Typescript and Python. I know programming languages as a topic is as inflammable as religion, but boy do I feel sad that these two are considered the most important these days. For server applications.

Anyways, can people here recommend alternatives with bindings for .net core?

  • Interesting, I don't think we've ever gotten a request for .net support. Our two most popular asks after PY and TS are Golang and Java, which basically tracks with the StackOverflow language survey:

    https://survey.stackoverflow.co/2024/technology#most-popular...

    • As I was disappointed does not mean you did make the wrong choice.

      .net core is a silent workhorse in midlevel to larger companies. Java is strong in the large enterprise (banking, gov, multinationals) scene.

      Your choice might be rational though:

        1. Both TS and PY have a relative large following of beginners. I have a hunch that those people are less likely to explore other ecosystems, and for them a tool like you have might be really appealing because it allows them to stay there.
      
      
        2. The bar to be competitive in .net and Java is on a higher level. 
      
      

      OTOH: Popularity on StackOverflow might be misleading.

        1. Python is a popular choice for "glueing" tasks, think data mangling and llm. Those people would be less inclined to build applications, that is not their job.
      
      
        2. TS is a heroic effort to fix JS, and that being the only choice for frontend means that this demographic is big. Some might want to bring this frontend fix to the server side to build server applications, but I don´t know how big that demographic is. It might grow due to the push for SSR.
      
        3. Both are beginner languages, resulting in more StackOverflow engagement