Comment by jorge-d
2 years ago
Well Sidekiq is free to use. It's only the pro version that he charges and the free version code is open source.
I don't see the problem in having that kind of business model, it still allows the community to thrive and offers entreprises a way to have premium support.
Plus it allows him to invest more time in maintaining the free version.
I have no problem paying for the Pro version, but one if its marketing pitches is "enhanced reliability", which is a wild marketing spin on "the free version will lose jobs in fairly common scenarios".
In sidekiq without super_fetch (a paid feature), any jobs in progress when a worker crashes are lost forever. If a worker merely encounters an exception the job will be put back on the queue and retried but a crash means the job is lost.
Again, no problem paying for Pro, but I would prefer a little more transparency on how big a gap that is.
I wish this was prominently documented. Most people new to Sidekiq have no idea that the job will be lost forever if you simply hard kill the worker. I have seen a couple of instances where the team had Sidekiq Pro, but they had not enabled reliable fetch because they were unaware of this problem
The free version acts exactly like Resque, the previous market leader in Ruby background jobs. If it was good enough reliability for GitHub and Shopify to use for years, it was good enough for Sidekiq OSS too.
Here's Resque literally using `lpop` which is destructive and will lose jobs.
https://github.com/resque/resque/blob/7623b8dfbdd0a07eb04b19...
> If it was good enough reliability for GitHub and Shopify to use for years, it was good enough for Sidekiq OSS too.
Great point, and thanks for chiming in. I wonder if containerization has made this more painful (due to cgroups and OOMs). The comments here are basically some people saying it's never been a problem for them and some people saying they encounter it a lot (in containerized environments) and have had to add mitigations.
Either way, my observation is a lot of people not paying for Sidekiq Pro should. I hope you can agree with that.
When we used Sidekiq in production, not only did I never see crashes that lost us jobs, but there are also ways to protect yourself from that. I highly recommend writing your jobs to be idempotent.
Idempotence doesn't solve this problem. The jobs are all idempotent. The problem is that jobs will never be retried if a crash occurs.
This doesn't happen at a high rate, but it happens more than zero times per week for us. We pay for Sidekiq Pro and have superfetch enabled so we are protected. If we didn't do so we'd need to create some additional infra to detect jobs that were never properly run and re-run them.
8 replies →
Jobs may crash due to VM issues or OOM problems. The more common cause of "orphans" is when the VM restarts and jobs can't finish during the shutdown period.
how often do your workers crash? i rely heavily on sidekiq and don't think I see this very often, if ever.
We process around 50M sidekiq jobs a day across a few hundred workers on a heavily autoscaled infrastructure.
Over the past week there were 2 jobs that would have been lost if not for superfetch.
It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.
Edit for additional color: One of the most common crashes we'll see is OutOfMemory. We run in a containerized environment and if a rogue job uses too much memory (or a deploy drastically changes our memory footprint) the container will be killed. In that scenario, the job is not placed back into the queue. SuperFetch is able to recover them, albeit with really lose guarantees around "when".
4 replies →
it’s not uncommon to lose jobs in sidekiq if you heavily rely on it and have a lot of jobs running. If using the free version for mission critical jobs, I usually run that task as a cron job to ensure that it will re-try if the job is lost.
I have in the past monitored how many jobs were lost and, although a small percentage, it was still recurring thing.
In containerized environments it may happen more often due to OOM kills or if you leverage autoscalers and have long running sidekiq jobs that have a runtime that exceeds your configured grace period for shutting down a container during a downscale and the process is eventually terminated without prejudice.
OOM kills are particularly pernicious as they can get into a vicious cycle of retry-killed-retry loops. The individual job causing the OOM isn't that important (we will identify it, log it and noop it), it's the blast radius effect on other sidekiq threads (we use up to 20 threads on some of our workers), so you want to be able to recover and re-run any jobs that are innocent victims of a misbehaving job.
Exactly why we refuse to use Sidekiq. “Hey, you have to pay to guarantee your jobs won’t just vanish”.
No thanks.