Comment by durkie

2 years ago

how often do your workers crash? i rely heavily on sidekiq and don't think I see this very often, if ever.

We process around 50M sidekiq jobs a day across a few hundred workers on a heavily autoscaled infrastructure.

Over the past week there were 2 jobs that would have been lost if not for superfetch.

It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.

Edit for additional color: One of the most common crashes we'll see is OutOfMemory. We run in a containerized environment and if a rogue job uses too much memory (or a deploy drastically changes our memory footprint) the container will be killed. In that scenario, the job is not placed back into the queue. SuperFetch is able to recover them, albeit with really lose guarantees around "when".

  • Let me get this straight, you're complaining about eight 9s of reliability?

    50,000,000 * 7 = 350,000,000

    2 / 350,000,000 = 0.000000005714286

    1 - (2 / 350,000,000) = 0.999999994285714 = 99.999999%

    > It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.

    If your system isn't resilient to 2 in 350,000,000 jobs failing I think there is something wrong with your system.

    • This isn't about 2 in 350,000,000 jobs failing. It's about 2 jobs disappearing entirely.

      It's not reliability we're talking about, it's about durability. For reference, S3 has eleven 9s of durability.

      Every major queuing system solves this problem. RabbitMQ uses unacknowledged messages which are pinned to a tcp connection, so when that connection drops before acknowledging them they get picked up by another worker. SQS uses visibility timeouts, where if the message hasn't been successfully processed within a time frame it's made available to other workers. Sidekiq free edition chooses not to solve it. And that's a fine stance for a free product, but just one I wish was made clearer.

      1 reply →

it’s not uncommon to lose jobs in sidekiq if you heavily rely on it and have a lot of jobs running. If using the free version for mission critical jobs, I usually run that task as a cron job to ensure that it will re-try if the job is lost.

I have in the past monitored how many jobs were lost and, although a small percentage, it was still recurring thing.

In containerized environments it may happen more often due to OOM kills or if you leverage autoscalers and have long running sidekiq jobs that have a runtime that exceeds your configured grace period for shutting down a container during a downscale and the process is eventually terminated without prejudice.

OOM kills are particularly pernicious as they can get into a vicious cycle of retry-killed-retry loops. The individual job causing the OOM isn't that important (we will identify it, log it and noop it), it's the blast radius effect on other sidekiq threads (we use up to 20 threads on some of our workers), so you want to be able to recover and re-run any jobs that are innocent victims of a misbehaving job.