← Back to context

Comment by jeffrallen

2 days ago

Whatever this "red-button" technology is, is pants. If you know you want to turn something off at incident + 10 mins, it should be off within a minute. Not "Preparing a change to trigger the red-button", but "the stop flag was set by an operator in a minute and was synched globally within seconds".

I mean, it's not like they don't have that technology: the worldwide sync was exactly what caused the outage.

At $WORK we use Consul for this job.

Generally, even these emergency changes are done not entirely immediately to prevent a fix from making things worse. This is an operational choice though, not a technical limitation. My guess being involved in similar issues in the past is the ~15 minute delay preparing the change was either that it wasn't a normally used big red button, so it wasn't clear how to use it, or there was some other friction preparing the change.

What is the difference between a red button and a feature flag, anyway? The report says there was no feature flagging, yet they had this "red button".

It sounds to me like something needed to be recompiled and redeployed.