Comment by victorbjorklund

1 month ago

For people that does not think it scales. A similar implementation in Elixir is Oban. Their benchmark shows a million jobs per minute on a single node (and I am sure it could be increased further with more optimizations). I bet 99,99999% of apps have less than a million background jobs per minute.

https://oban.pro/articles/one-million-jobs-a-minute-with-oba...

16 comments

victorbjorklund

formerly_proven 1 month ago

This benchmark is probably as far removed from how applications use task queues as it could possibly be. The headline is "1 million jobs per minute", which is true. However...

- this is achieved by queuing batches of 5000 jobs, so on the queue side this is actually not 1 million TPS, but rather 200 TPS. I've never seen any significant batching of background job creation.

- the dispatch is also batched to a few hundred TPS (5ms ... 2ms).

- acknowledgements are also batched.

So instead of the ~50-100k TPS that you would expect to get to 17k jobs/sec, this is probably performing just a few hundred transactions per second on the SQL side. Correspondingly, if you don't batch everything (job submission, acking; dispatch is reasonable), throughput likely drops to that level, which is much more in line with expectations.

Semantically this benchmark is much closer to queuing and running 200 invocations of a "for i in range(5000)" loop in under a minute, which most would expect virtually any DB to handle (even SQLite).

erikpukinskis 1 month ago

Also worth noting that it’s often not single-node performance that caps throughput… it’s replication.
Databases are pretty good at quickly adding and removing lots of rows. But even if you can keep up with churning through 1000 rows/second, with batching or whatever, you still need to replicate 1000 rows/second do your failover nodes.
That’s the big win for queues over a relational db here: queues have ways to efficiently replicate without copying the entire work queue across instances.
victorbjorklund 1 month ago

Yes, all benchmarks lie. It's just like if you're seeing a benchmark about how many inserts Postgres can do. it's usually not based on reality because that's never how a real application looks like, but it's rather pointing out the maximum performance under perfect conditions, which you, of course, would never really have in reality. But again, I think that it's not about if you're reaching 20k or 50k or 100k jobs per second because if you're at that scale, yeah, you should probably look at other solutions. But again, most applications probably have less than a thousand jobs per second.
cess11 1 month ago

The 5k batching is done when inserting the jobs into the database. It's not like they exert some special control over the performance of the database engine, and this isn't what they're trying to measure in the article.
They spend some time explaining how to tune the job runners to double the 17k jobs/s. The article is kind of old, Elixir 1.14 was a while ago, and it is basically a write-up on how they managed a bit of performance increase by using new features of this language version.
uep 1 month ago

This isn't my area, but wouldn't this still be quite effective if it automatically grouped and batched those jobs for you? At low throughput levels, it doesn't need giant batches, and could just timeout after a very short time, and submit smaller batches. At high throughput, they would be full batches. Either way, it seems like this would still serve the purpose, wouldn't it?

parthdesai 1 month ago

Funny you mention Oban, we do use it at work as well, and first thing Oban tells you is to either use Redis as a notifier or resort to polling for jobs and just not notify.

https://hexdocs.pm/oban/scaling.html

victorbjorklund 1 month ago
I don't think that Oban is telling you to always use Redis. I think what they're saying is if you reach a certain scale where you're feeling the pain of the default notifier you could use Oban.Notifiers.PG as long as your application is running as a cluster. If you don't run it as a cluster, then you might have to reach for Redis. But then it's more about not running a cluster.
- parthdesai 1 month ago
  
  > For people that does not think it scales
  You started your comment with that
  
  1 reply →
bgentry 1 month ago
This is largely because LISTEN/NOTIFY has an implementation which uses a global lock. At high volume this obviously breaks down: https://www.recall.ai/blog/postgres-listen-notify-does-not-s...
None of that means Oban or similar queues don't/can't scale—it just means a high volume of NOTIFY doesn't scale, hence the alternative notifiers and the fact that most of its job processing doesn't depend on notifications at all.
There are other reasons Oban recommends a different notifier per the doc link above:
> That keeps notifications out of the db, reduces total queries, and allows larger messages, with the tradeoff that notifications from within a database transaction may be sent even if the transaction is rolled back
- parthdesai 1 month ago
  
  > None of that means Oban or similar queues don't/can't scale—it just means a high volume of NOTIFY doesn't scale
  Given the context of this post, it really does mean the same thing though?
  
  2 replies →
- victorbjorklund 1 month ago
  
  Yup. I wasn’t talking about notify in particular but about using Postgres in general.
solid_fuel 1 month ago

Not quite, I used it at work too - the first thing that page suggests is using `Oban.Notifiers.PG` which uses distributed erlang's Process Group implementation, not Redis. You only really need Redis if you're not running with erlang clustering, but doing that rules out several other great elixir features.