← Back to context

Comment by charcircuit

2 days ago

>We will enforce all changes to critical binaries to be feature flag protected and disabled by default.

Not only if this was true it would kill developer productivity, but how can this even be done. When the compiler versions bumps that's a lot changes that need to be gated. Every team that works on a dependency will have to add feature flags for any change and bug fix they make.

Yeah, some of their takeaways seem overly stifling. This sounds like it wasn't a case of a broken process or missing ingredients. All the tools to prevent it were there (feature flags, null handling, basic static analysis tools). Someone just didn't know to use them.

This also got a laugh:

We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage. For some customers, the monitoring infrastructure they had running on Google Cloud was also failing, leaving them without a signal of the incident or an understanding of the impact to their business and/or infrastructure.

You should always have at least some kind of basic monitoring that's on completely separate infrastructure, ideally from a different vendor. (And maybe Google should too)

  • this is the worst GCP outage I can remember

    > It took up to ~2h 40 mins to fully resolve in us-central-1

    this would have cost their customers tens of millions, maybe north of $100M.

    not surprised they'd have an extreme write up like this.

I've actually seen this happen before. Most changes get deployed with a gradually enabled(% of users/cities/...) feature flag and then cleaned up later on. There is a slack notification from the central service which manages them that tells you how many rollouts are complete reminding you to clean them up. It escalates to SRE if you don't pay heed for long enough.

In the rollout duration, the combinatorial explosion of code paths is rather annoying to deal with. On the other hand, it does encourage not having too many things going on at once. If some of the changes affect business metrics, it will be hard to glean any insights if you have too many of them going on at once.

And yet it's not even close to the productivity hurting rules and procedures followed for aerospace software.

Rules in aerospace are written in blood. Rules of internet software are written in inconvenience, so productivity is usually given much higher priority than reducing risk of catastrophic failure.