Comment by echelon
4 hours ago
You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.
During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.
Certain well-understood migrations are the only cases where roll back might not be acceptable.
Always keep your services in "roll back able", "graceful fail", "fail open" state.
This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.
Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.
I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.
"Fail open" state would have been improper here, as the system being impacted was a security-critical system: firewall rules.
It is absolutely the wrong approach to "fail open" when you can't run security-critical operations.
Cloudflare is supposed to protect me from occasional ddos, not take my business offline entirely.
This can be architected in such a way that if one rules engine crashes, other systems are not impacted and other rules, cached rules, heuristics, global policies, etc. continue to function and provide shielding.
You can't ask for Cloudflare to turn on a dime and implement this in this manner. Their infra is probably very sensibly architected by great engineers. But there are always holes, especially when moving fast, migrating systems, etc. And there's probably room for more resiliency.