Comment by KraftyOne

10 months ago

You have a great point that scaling a library model requires careful design, but we think that's worth it to provide a superior developer experience.

For example, the DBOS library provides an API to instruct a worker to recover specific tasks. In our hosted platform (DBOS Cloud), when a worker crashes, a central server uses this API to tell the workers what to recover. We like this design because it provides the best of both worlds--the coordination decision is centralized, so it's performant/scalable, but the actual recovery and workflow execution is done in-process, so DBOS doesn't turn your program into a distributed system the way Step Functions/Temporal do (I haven't used Hatchet).

Definitely agree that the dev experience is better with a library, particularly for lightweight and low-volume tasks (Hatchet is also moving in the same direction, we'll be releasing library-only mode this month). And I really like the transactional safety built into DBOS!

My concern is as you start to see higher volume, more workers, or load patterns that don't correspond to your primary API. At that point, a dedicated database and service to orchestrate tasks starts to become more necessary.

  • I disagree. Once you get to the scale that breaks a library based system like DBOS, you need to move away from a central coordinator. At that point your software has to adjust and you have to build your application to work without those types of centralized systems.

    Centralized systems don't scale to the biggest problems no matter how clever you are.

    And usually the best way to scale is to decentralize. Make idempotent updates that can be done more locally and use eventual consistency to keep things aligned.

    • You can get around some of this coordination by doing more work locally, in-process -- but you're risking the availability of your primary API. The work that you're doing as a background job or workflow may:

      1. Be unbounded -- 1 API call may correspond to thousands or hundreds of thousands of reads or writes

      2. Be resource intensive, on either the database side by eating up connections which are needed by your API, or by blocking your event loop/process, particularly in Python and Typescript

      > Once you get to the scale that breaks a library based system like DBOS, you need to move away from a central coordinator

      There are different levels of coordination here. At some point, workers are always going to have to coordinate "who is working on what" -- whether that's by placing locks and using status updates in the database, using a traditional queue, or using a central server like you described with DBOS Cloud. The same goes for a dedicated orchestrator -- those can also partition by "who is working on which workers." In other words, a dedicated service can also be decentralized, these don't seem mutually exclusive.

      To make this more concrete -- let's say, for example, you have hundreds of workers on a library based system like DBOS which all correspond to individual connections and transactions to read/write data. Whereas in a system like Temporal or Hatchet, you might have 10s of nodes to support 1000s of workers, with the ability to bulk enqueue and dequeue more work, which will increase throughput and reduce DB load rather significantly. You'd lose (most of) the benefits of bulk writes in a library based system.

      1 reply →

    • I agree, and ever the problem is folks are unwilling to grasp that. Making workloads idempotent is apparently too difficult.