Comment by jeffrallen
2 days ago
Whatever this "red-button" technology is, is pants. If you know you want to turn something off at incident + 10 mins, it should be off within a minute. Not "Preparing a change to trigger the red-button", but "the stop flag was set by an operator in a minute and was synched globally within seconds".
I mean, it's not like they don't have that technology: the worldwide sync was exactly what caused the outage.
At $WORK we use Consul for this job.
Generally, even these emergency changes are done not entirely immediately to prevent a fix from making things worse. This is an operational choice though, not a technical limitation. My guess being involved in similar issues in the past is the ~15 minute delay preparing the change was either that it wasn't a normally used big red button, so it wasn't clear how to use it, or there was some other friction preparing the change.
What is the difference between a red button and a feature flag, anyway? The report says there was no feature flagging, yet they had this "red button".
It sounds to me like something needed to be recompiled and redeployed.