← Back to context

Comment by gfodor

6 years ago

Wow, I'm amazed two people could read that writeup (yourself and myself) and come to two totally different conclusions.

Pushing out a brand new regex engine surely will go through the usual process. This doesn't seem like it will take a lot of time unless there are surprises. Cloudflare clearly has the infrastructure in place already to do a proper integration test for correctness test and rampup infrastructure to ensure it doesn't cause a global outage. The global nature of this outage was because the rampup infrastructure was explicitly not used as per the protocol.

I have no idea what you read where a single engineer was singled out. At several points in this post mortem the author identifies that the regex being written by the individual involved was far from the only cause of the outage. This is a very textbook blameless post mortem doc afaict.

The narrative about the actions taken and meetings which were in is also par for the course for a good post mortem since these variables are real, and should be addressed by remediation items if they contributed to the outage. (For example, is it sane that the entire engineering team was synchronously in a meeting? Probably not.)