Comment by aqme28

2 years ago

When we used Sidekiq in production, not only did I never see crashes that lost us jobs, but there are also ways to protect yourself from that. I highly recommend writing your jobs to be idempotent.

10 comments

aqme28

phamilton 2 years ago

Idempotence doesn't solve this problem. The jobs are all idempotent. The problem is that jobs will never be retried if a crash occurs.

This doesn't happen at a high rate, but it happens more than zero times per week for us. We pay for Sidekiq Pro and have superfetch enabled so we are protected. If we didn't do so we'd need to create some additional infra to detect jobs that were never properly run and re-run them.

ransackdev 2 years ago

Or install an opensource gem[1] that recreates the functionality using the same redis rpoplpush[2] command
[1] https://gitlab.com/gitlab-org/ruby/gems/sidekiq-reliable-fet...
[2] https://redis.io/commands/rpoplpush/#pattern-reliable-queue
aqme28 2 years ago

Fair enough about idempotence.
I'm still confused about what you're saying though. You're saying that the language of "enhanced reliability" doesn't reflect losing 2 jobs over about 50*7 million (from your other comment)?
And that if you didn't pay for the service, you'd have to add some checks to make up for this?
That all seems incredibly reasonable to me.
danenania 2 years ago
Crashes are under your control though. They’re not caused by sidekiq. And you could always add your own crash recovery logic, as you say. To me that makes it a reasonable candidate for a pro feature.
It’s hard to get this right though. No matter where the line gets drawn, free users will complain that they don’t get everything for free.
- Mavvie 2 years ago
  
  How are crashes under your control? Again they aren't talking about uncaught exceptions, but crashes. So maybe the server gets unplugged, the network disconnects, etc.
  
  4 replies →

pselbert 2 years ago

Jobs may crash due to VM issues or OOM problems. The more common cause of "orphans" is when the VM restarts and jobs can't finish during the shutdown period.