← Back to context

Comment by jedberg

1 month ago

I disagree. Once you get to the scale that breaks a library based system like DBOS, you need to move away from a central coordinator. At that point your software has to adjust and you have to build your application to work without those types of centralized systems.

Centralized systems don't scale to the biggest problems no matter how clever you are.

And usually the best way to scale is to decentralize. Make idempotent updates that can be done more locally and use eventual consistency to keep things aligned.

You can get around some of this coordination by doing more work locally, in-process -- but you're risking the availability of your primary API. The work that you're doing as a background job or workflow may:

1. Be unbounded -- 1 API call may correspond to thousands or hundreds of thousands of reads or writes

2. Be resource intensive, on either the database side by eating up connections which are needed by your API, or by blocking your event loop/process, particularly in Python and Typescript

> Once you get to the scale that breaks a library based system like DBOS, you need to move away from a central coordinator

There are different levels of coordination here. At some point, workers are always going to have to coordinate "who is working on what" -- whether that's by placing locks and using status updates in the database, using a traditional queue, or using a central server like you described with DBOS Cloud. The same goes for a dedicated orchestrator -- those can also partition by "who is working on which workers." In other words, a dedicated service can also be decentralized, these don't seem mutually exclusive.

To make this more concrete -- let's say, for example, you have hundreds of workers on a library based system like DBOS which all correspond to individual connections and transactions to read/write data. Whereas in a system like Temporal or Hatchet, you might have 10s of nodes to support 1000s of workers, with the ability to bulk enqueue and dequeue more work, which will increase throughput and reduce DB load rather significantly. You'd lose (most of) the benefits of bulk writes in a library based system.

  • To be clear, a DBOS worker isn't just one connection, but an entire application server that can handle many requests concurrently and batch/bulk write database operations. Past that, I don't really think we're disagreeing: there are clever optimizations (batching, offloading coordination where possible) you can do in both a library-based and workflow-server-based system to scale better, but at truly massive scale you'll bottleneck in Postgres.

I agree, and ever the problem is folks are unwilling to grasp that. Making workloads idempotent is apparently too difficult.