Comment by crote
2 hours ago
Is a roll back even possible at Cloudflare's size?
With small deployments it usually isn't too difficult to re-deploy a previous commit. But once you get big enough you've got enough developers that half a dozen PRs will have been merged since the start of the incident and now. How viable is it to stop the world, undo everything, and start from scratch any time a deployment causes the tiniest issues?
Realistically the best you're going to get is merging a revert of the problematic changeset - but with the intervening merges that's still going to bring the system in a novel state. You're rolling forwards, not backwards.
Disclosure: Former Cloudflare SRE.
The short answer is "yes" due to the way the configuration management works. Other infrastructure changes or service upgrades might get undone, but it's possible. Or otherwise revert the commit that introduced the package bump with the new code and force that to rollout everywhere rather than waiting for progressive rollout.
There shouldn't be much chance of bringing the system to a novel state because configuration management will largely put things into the correct state. (Where that doesn't work is if CM previously created files, it won't delete them unless explicitly told to do so.)
That will depend on how you structure your deployments, on some large tech companies, while thousands of changes little are made every hour, and deployments are mande in n-day cycles. A cut-off point in time is made where the first 'green' commit after that is picked for the current deployment, and if that fails in an unexpected way you just deploy the last binary back, fix (and test) whatever broke and either try again or just abandon the release if the next cut is already close-by.
If companies like Cloudflare haven't figured out how to do reliable rollbacks, there seems little hope for any of us.
I'd presume they have the ability to deploy a previous artifact vs only tip-of-master.