Comment by dns_snek

1 month ago

Facing issues with 83 jobs per second (5k/min) sounds like an extreme misconfiguration. That's not high throughput at all and it shouldn't create any appreciable load on any database.

This comes up every time this conversation occurs.

Yes, PG can theoretically handle just about anything with the right configuration, schema, architecture, etc.

Finding that right configuration is not trivial. Even dedicated frameworks like Graphile struggle with it.

My startup had the exact same struggles with PG and did the same migration to BullMQ bc we were sick of fiddling with it instead of solving business problems. We are very glad we migrated off of PG for our work queues.

  • The issue is that "83 per second" is multiple orders of magnitude off the expected level of performance on any RDBMS running on anything resembling modern hardware.

    I haven't worked with Graphile but this just doesn't pass the sniff test unless those 83 jobs per second are somehow translating into thousands of write transactions per second.

    Their documentation has a performance section with a benchmark that claims to process 10k jobs per second on a pretty modest machine, as an indication.

    • > The issue is that "83 per second" is multiple orders of magnitude off the expected level of performance on any RDBMS running on anything resembling modern hardware.

      This is just not true, there are so many scenarios where 83/sec would be the limit. That number by itself is almost meaningless, similar to benchmarks which also make a bunch of assumptions about workloads and runtime environments.

      As a simple example if your queue has a large backlog, you have a large worker fleet aggressively pulling work to minimize latency, your payloads are large, you have not optimized indexing, and/or you have many jobs scheduled for the future, every acquire can be an expensive table scan.

      (This is a specific example because this is one of many failure scenarios I’ve encountered with Graphile that can cause your DB to meltdown. The same workload in Redis barely causes a blip in Redis CPU, without having to fiddle with indexes and auto vacuuming and worker backoffs.)

They probably did not batch. It’s realistic they will have issues if code is written to handle 1 job at a time and needs to make several roundtrips to the same db inside the same locking transaction.

Leases exist for a reason.

  • > if code is written to handle 1 job at a time and needs to make several roundtrips to the same db inside the same locking transaction.

    Do you mean the application code? The worker itself causing the bottleneck is definitely one possibility however if that were the case the issue wouldn't have resolved itself when they switched to a different job queue.

    • Well, you would no longer have thousands of open transactions maintaining the locks.

      Of course, if q on top of psql is reasonably implemented (lease), what they say makes no sense.