← Back to context

Comment by phamilton

2 years ago

I have no problem paying for the Pro version, but one if its marketing pitches is "enhanced reliability", which is a wild marketing spin on "the free version will lose jobs in fairly common scenarios".

In sidekiq without super_fetch (a paid feature), any jobs in progress when a worker crashes are lost forever. If a worker merely encounters an exception the job will be put back on the queue and retried but a crash means the job is lost.

Again, no problem paying for Pro, but I would prefer a little more transparency on how big a gap that is.

I wish this was prominently documented. Most people new to Sidekiq have no idea that the job will be lost forever if you simply hard kill the worker. I have seen a couple of instances where the team had Sidekiq Pro, but they had not enabled reliable fetch because they were unaware of this problem

The free version acts exactly like Resque, the previous market leader in Ruby background jobs. If it was good enough reliability for GitHub and Shopify to use for years, it was good enough for Sidekiq OSS too.

Here's Resque literally using `lpop` which is destructive and will lose jobs.

https://github.com/resque/resque/blob/7623b8dfbdd0a07eb04b19...

  • > If it was good enough reliability for GitHub and Shopify to use for years, it was good enough for Sidekiq OSS too.

    Great point, and thanks for chiming in. I wonder if containerization has made this more painful (due to cgroups and OOMs). The comments here are basically some people saying it's never been a problem for them and some people saying they encounter it a lot (in containerized environments) and have had to add mitigations.

    Either way, my observation is a lot of people not paying for Sidekiq Pro should. I hope you can agree with that.

When we used Sidekiq in production, not only did I never see crashes that lost us jobs, but there are also ways to protect yourself from that. I highly recommend writing your jobs to be idempotent.

  • Idempotence doesn't solve this problem. The jobs are all idempotent. The problem is that jobs will never be retried if a crash occurs.

    This doesn't happen at a high rate, but it happens more than zero times per week for us. We pay for Sidekiq Pro and have superfetch enabled so we are protected. If we didn't do so we'd need to create some additional infra to detect jobs that were never properly run and re-run them.

    • Fair enough about idempotence.

      I'm still confused about what you're saying though. You're saying that the language of "enhanced reliability" doesn't reflect losing 2 jobs over about 50*7 million (from your other comment)?

      And that if you didn't pay for the service, you'd have to add some checks to make up for this?

      That all seems incredibly reasonable to me.

    • Crashes are under your control though. They’re not caused by sidekiq. And you could always add your own crash recovery logic, as you say. To me that makes it a reasonable candidate for a pro feature.

      It’s hard to get this right though. No matter where the line gets drawn, free users will complain that they don’t get everything for free.

      5 replies →

  • Jobs may crash due to VM issues or OOM problems. The more common cause of "orphans" is when the VM restarts and jobs can't finish during the shutdown period.

how often do your workers crash? i rely heavily on sidekiq and don't think I see this very often, if ever.

  • We process around 50M sidekiq jobs a day across a few hundred workers on a heavily autoscaled infrastructure.

    Over the past week there were 2 jobs that would have been lost if not for superfetch.

    It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.

    Edit for additional color: One of the most common crashes we'll see is OutOfMemory. We run in a containerized environment and if a rogue job uses too much memory (or a deploy drastically changes our memory footprint) the container will be killed. In that scenario, the job is not placed back into the queue. SuperFetch is able to recover them, albeit with really lose guarantees around "when".

    • Let me get this straight, you're complaining about eight 9s of reliability?

      50,000,000 * 7 = 350,000,000

      2 / 350,000,000 = 0.000000005714286

      1 - (2 / 350,000,000) = 0.999999994285714 = 99.999999%

      > It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.

      If your system isn't resilient to 2 in 350,000,000 jobs failing I think there is something wrong with your system.

      3 replies →

  • it’s not uncommon to lose jobs in sidekiq if you heavily rely on it and have a lot of jobs running. If using the free version for mission critical jobs, I usually run that task as a cron job to ensure that it will re-try if the job is lost.

    I have in the past monitored how many jobs were lost and, although a small percentage, it was still recurring thing.

  • In containerized environments it may happen more often due to OOM kills or if you leverage autoscalers and have long running sidekiq jobs that have a runtime that exceeds your configured grace period for shutting down a container during a downscale and the process is eventually terminated without prejudice.

    OOM kills are particularly pernicious as they can get into a vicious cycle of retry-killed-retry loops. The individual job causing the OOM isn't that important (we will identify it, log it and noop it), it's the blast radius effect on other sidekiq threads (we use up to 20 threads on some of our workers), so you want to be able to recover and re-run any jobs that are innocent victims of a misbehaving job.

Exactly why we refuse to use Sidekiq. “Hey, you have to pay to guarantee your jobs won’t just vanish”.

No thanks.