I don't disagree with that call out. However, we've been through these discussions many times over the years. The solid queue of yesteryear was delayed_job which was originally created by Shopify's CEO.
Shopify however grew (as many others) and we saw a host of blog posts and talks about moving away from DB queues to Redis, RabbitMQ, Kafka etc. We saw posts about moving from Resque to SideKiq etc. All this to day storing a task queue in the db has always been the naive approach. Engineers absolutely shouldn't be shocked that approach isn't viable at higher workloads.
It's not like I'll get a choice between the task database going down and not going down. If my task database goes down, I'm either losing jobs or duplicating jobs, and I have to pick which one I want. Whether the downtime is at the same time as the production database or not is irrelevant.
In fact, I'd rather it did happen at the same time as production, so I don't have to reconcile a bunch of data on top of the tasks.
Here's an example from the circleci incident
https://status.circleci.com/incidents/hr0mm9xmm3x6
and a good analysis by a flicker engineer who ran into similar issues
https://blog.mihasya.com/2015/07/19/thoughts-evoked-by-circl...
CircleCI and Flickr are both pretty big systems. There are tons of businesses that will never operate at that scale.
I don't disagree with that call out. However, we've been through these discussions many times over the years. The solid queue of yesteryear was delayed_job which was originally created by Shopify's CEO.
https://github.com/tobi/delayed_job
Shopify however grew (as many others) and we saw a host of blog posts and talks about moving away from DB queues to Redis, RabbitMQ, Kafka etc. We saw posts about moving from Resque to SideKiq etc. All this to day storing a task queue in the db has always been the naive approach. Engineers absolutely shouldn't be shocked that approach isn't viable at higher workloads.
If you need to restore the production database do you also want to restore the task database?
If your task is to send an email, do you want to send it again? Probably not.
It's not like I'll get a choice between the task database going down and not going down. If my task database goes down, I'm either losing jobs or duplicating jobs, and I have to pick which one I want. Whether the downtime is at the same time as the production database or not is irrelevant.
In fact, I'd rather it did happen at the same time as production, so I don't have to reconcile a bunch of data on top of the tasks.