Comment by liampulles
6 hours ago
Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.
I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.
Rollback carries with it the contextual understanding of complete atomicity; otherwise it's slightly better than a yeet. It's similar to backups that are untested.
Complete atomicity carries with it the idea that the world is frozen, and any data only needs to change when you allow it to.
That's to say, it's an incredibly good idea when you can physically implement it. It's not something that everybody can do.
No, complete atomicity doesn't require a frozen state, it requires common sense and fail-proof, fool-proof guarantees derived from assurances gained from testing.
There is another name for rolling forward, it's called tripping up.
Global rollout of security code on a timeframe of seconds is part of Cloudflare's value proposition.
In this case they got unlucky with an incident before they finished work on planned changes from the last incident.