Comment by victorbjorklund
13 hours ago
For people that does not think it scales. A similar implementation in Elixir is Oban. Their benchmark shows a million jobs per minute on a single node (and I am sure it could be increased further with more optimizations). I bet 99,99999% of apps have less than a million background jobs per minute.
https://oban.pro/articles/one-million-jobs-a-minute-with-oba...
Funny you mention Oban, we do use it at work as well, and first thing Oban tells you is to either use Redis as a notifier or resort to polling for jobs and just not notify.
https://hexdocs.pm/oban/scaling.html
Not quite, I used it at work too - the first thing that page suggests is using `Oban.Notifiers.PG` which uses distributed erlang's Process Group implementation, not Redis. You only really need Redis if you're not running with erlang clustering, but doing that rules out several other great elixir features.
I don't think that Oban is telling you to always use Redis. I think what they're saying is if you reach a certain scale where you're feeling the pain of the default notifier you could use Oban.Notifiers.PG as long as your application is running as a cluster. If you don't run it as a cluster, then you might have to reach for Redis. But then it's more about not running a cluster.
> For people that does not think it scales
You started your comment with that
This is largely because LISTEN/NOTIFY has an implementation which uses a global lock. At high volume this obviously breaks down: https://www.recall.ai/blog/postgres-listen-notify-does-not-s...
None of that means Oban or similar queues don't/can't scale—it just means a high volume of NOTIFY doesn't scale, hence the alternative notifiers and the fact that most of its job processing doesn't depend on notifications at all.
There are other reasons Oban recommends a different notifier per the doc link above:
> That keeps notifications out of the db, reduces total queries, and allows larger messages, with the tradeoff that notifications from within a database transaction may be sent even if the transaction is rolled back
> None of that means Oban or similar queues don't/can't scale—it just means a high volume of NOTIFY doesn't scale
Given the context of this post, it really does mean the same thing though?
1 reply →
This benchmark is probably as far removed from how applications use task queues as it could possibly be. The headline is "1 million jobs per minute", which is true. However...
- this is achieved by queuing batches of 5000 jobs, so on the queue side this is actually not 1 million TPS, but rather 200 TPS. I've never seen any significant batching of background job creation.
- the dispatch is also batched to a few hundred TPS (5ms ... 2ms).
- acknowledgements are also batched.
So instead of the ~50-100k TPS that you would expect to get to 17k jobs/sec, this is probably performing just a few hundred transactions per second on the SQL side. Correspondingly, if you don't batch everything (job submission, acking; dispatch is reasonable), throughput likely drops to that level, which is much more in line with expectations.
Semantically this benchmark is much closer to queuing and running 200 invocations of a "for i in range(5000)" loop in under a minute, which most would expect virtually any DB to handle (even SQLite).
Also worth noting that it’s often not single-node performance that caps throughput… it’s replication.
Databases are pretty good at quickly adding and removing lots of rows. But even if you can keep up with churning through 1000 rows/second, with batching or whatever, you still need to replicate 1000 rows/second do your failover nodes.
That’s the big win for queues over a relational db here: queues have ways to efficiently replicate without copying the entire work queue across instances.
Yes, all benchmarks lie. It's just like if you're seeing a benchmark about how many inserts Postgres can do. it's usually not based on reality because that's never how a real application looks like, but it's rather pointing out the maximum performance under perfect conditions, which you, of course, would never really have in reality. But again, I think that it's not about if you're reaching 20k or 50k or 100k jobs per second because if you're at that scale, yeah, you should probably look at other solutions. But again, most applications probably have less than a thousand jobs per second.
This isn't my area, but wouldn't this still be quite effective if it automatically grouped and batched those jobs for you? At low throughput levels, it doesn't need giant batches, and could just timeout after a very short time, and submit smaller batches. At high throughput, they would be full batches. Either way, it seems like this would still serve the purpose, wouldn't it?
The 5k batching is done when inserting the jobs into the database. It's not like they exert some special control over the performance of the database engine, and this isn't what they're trying to measure in the article.
They spend some time explaining how to tune the job runners to double the 17k jobs/s. The article is kind of old, Elixir 1.14 was a while ago, and it is basically a write-up on how they managed a bit of performance increase by using new features of this language version.