Comment by ZephyrBlu
2 years ago
Let me get this straight, you're complaining about eight 9s of reliability?
50,000,000 * 7 = 350,000,000
2 / 350,000,000 = 0.000000005714286
1 - (2 / 350,000,000) = 0.999999994285714 = 99.999999%
> It's not a ton, but it's not zero. And when it comes to data durability the difference between zero and not zero is usually all that matters.
If your system isn't resilient to 2 in 350,000,000 jobs failing I think there is something wrong with your system.
This isn't about 2 in 350,000,000 jobs failing. It's about 2 jobs disappearing entirely.
It's not reliability we're talking about, it's about durability. For reference, S3 has eleven 9s of durability.
Every major queuing system solves this problem. RabbitMQ uses unacknowledged messages which are pinned to a tcp connection, so when that connection drops before acknowledging them they get picked up by another worker. SQS uses visibility timeouts, where if the message hasn't been successfully processed within a time frame it's made available to other workers. Sidekiq free edition chooses not to solve it. And that's a fine stance for a free product, but just one I wish was made clearer.
If you want to focus on durability then I think your complaint makes even less sense. Somehow I doubt S3 is primarily backed by Redis.
I think it's fair to assume that something backed by Redis is not durable by default because that's not what Redis is known for, whereas the other options you listed are known for their resiliency and durability. I wouldn't view Sidekiq as a similar product to RabbitMQ and SQS.
Also, Sidekiq Pro uses more advanced Redis features to enable super_fetch lending to the assumption that by default Redis is not durable: https://www.bigbinary.com/blog/increase-reliability-of-backg....