← Back to context

Comment by rkagerer

2 days ago

Yeah, some of their takeaways seem overly stifling. This sounds like it wasn't a case of a broken process or missing ingredients. All the tools to prevent it were there (feature flags, null handling, basic static analysis tools). Someone just didn't know to use them.

This also got a laugh:

We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage. For some customers, the monitoring infrastructure they had running on Google Cloud was also failing, leaving them without a signal of the incident or an understanding of the impact to their business and/or infrastructure.

You should always have at least some kind of basic monitoring that's on completely separate infrastructure, ideally from a different vendor. (And maybe Google should too)

this is the worst GCP outage I can remember

> It took up to ~2h 40 mins to fully resolve in us-central-1

this would have cost their customers tens of millions, maybe north of $100M.

not surprised they'd have an extreme write up like this.