Comment by rajaravivarma_r
1 month ago
The one use case where a DB backed queue will fail for sure is when the payload is large. For example, you queue a large JSON payload to be picked up by a worker and process it, then the DB writing overhead itself makes a background worker useless.
I've benchmarked Redis (Sidekiq), Postgres (using GoodJob) and SQLite (SolidQueue), Redis beats everything else for the above usecase.
SolidQueue backed by SQLite may be good when you are just passing around primary keys. I still wonder if you can have a lot of workers polling from the same database and update the queue with the job status. I've done something similar in the past using SQLite for some personal work and it is easy to hit the wall even with 10 or so workers.
In my experience you want job parameters to be one, maybe two ids. Do you have a real world example where that is not the case?
I'm guessing you're with that adding indirection for what you're actually processing, in that case? So I guess the counter-case would be when you don't want/need that indirection.
If I understand what you're saying, is that you'll instead of doing:
- Create job with payload (maybe big) > Put in queue > Let worker take from queue > Done
You're suggesting:
- Create job with ID of payload (stored elsewhere) > Put in queue > Let worker take from queue, then resolve ID to the data needed for processing > Done
Is that more or less what you mean? I can definitively see use cases for both, heavily depends on the situation, but more indirection isn't always better, nor isn't big payloads always OK.
If we take webhook for example.
- Persist payload in db > Queue with id > Process via worker.
Push the payload directly to queue can be tricky. Any queue system usually will have limits on the payload size, for good reasons. Plus if you already commit to db, you can guarantee the data is not lost and can be process again however you want later. But if your queue is having issue, or it failed to queue, you might lost it forever.
3 replies →
> I can definitively see use cases for both
Me too, I was just wondering if you have any real world examples of a project with a large payload.
I have been doing this for at least a decade now and it is a great pattern, but think of an ETL pipeline where you fetch a huge JSON payload, store it in the database and then transform it and load it in another model. I had an use case where I wanted to process the JSON payload and pass it down the pipeline before storing it in the useful model. I didn't want to store the intermediate JSON anywhere. I benchmarked it for this specific use case.
...well, that's good for scaling the queue, but this means the worker needs to load all relevant state/context from some DB (which might be sped up with a cache, but then things are getting really complex)
ideally you pass the context that's required for the job (let's say it's less than 100Kbytes), but I don't think that counts as large JSON, but request rate (load) can make even 512byte too much, therefore "it depends"
but in general passing around large JSONs on the network/memory is not really slow compared to writing them to a DB (WAL + fsync + MVCC management)
> Redis beats everything else for the above usecase.
Reminds me of Antirez blog post that when Redis is configured for durability it becomes like/slower than postgresql http://oldblog.antirez.com/post/redis-persistence-demystifie...
May be, but over 6 years of using Redis with bare minimum setup, I have never lost any data and my use case happens to be queuing intermediate results, so durability won't be an issue.
There's been 6 major releases and countless improvements on Redis since then, I don't think we can say whether it's still relevant.
Also, Antirez has always been very opinionated on not comparing or benchmarking Redis against other dbs for a decade.
> The one use case where a DB backed queue will fail for sure is when the payload is large. For example, you queue a large JSON payload to be picked up by a worker and process it, then the DB writing overhead itself makes a background worker useless.
redis would suffer from the same issue. Possibly even more severely due to being memory constrained?
I'd probably just stuff the "large data" in s3 or something like that, and just include the reference/location of the data in the actual job itself, if it was big enough to cause problems.
Interesting, as a self-contained minimalistic setup.
Shouldn't one be using a storage system such as S3/garage with ephemeral settings and/or clean-up triggers after job-end ? I get the appeal of using one-system-for-everything but won't you need a storage system anyway for other parts of your system ?
Have you written up somewhere about your benchmarks and where the cutoffs are (payload size / throughput / latency) ?
FWIW, Sidekiq docs strongly suggest only passing around primary keys or identifiers for jobs.
Using Redis to store large queue payloads is usually a bad practice. Redis memory is finite.
this!! 100%.
pass around ID's