← Back to context

Comment by kubectl_h

2 years ago

In containerized environments it may happen more often due to OOM kills or if you leverage autoscalers and have long running sidekiq jobs that have a runtime that exceeds your configured grace period for shutting down a container during a downscale and the process is eventually terminated without prejudice.

OOM kills are particularly pernicious as they can get into a vicious cycle of retry-killed-retry loops. The individual job causing the OOM isn't that important (we will identify it, log it and noop it), it's the blast radius effect on other sidekiq threads (we use up to 20 threads on some of our workers), so you want to be able to recover and re-run any jobs that are innocent victims of a misbehaving job.