Comment by woliveirajr
6 years ago
> Unfortunately, last Tuesday’s update contained a regular expression that backtracked enormously and exhausted CPU used for HTTP/HTTPS serving.
One of those cases where they had 1 problem, used regular expression and ended up with 2 problems ?
Edit: I really like how much information is given by CloudFlare. 11 points in the "what went wrong analysis" is how every root-cause analysis should be done.
Somewhat humorous, as someone [1] (congrats /u/fossuser!) mentioned this failure scenario in the thread about Twitter being down yesterday.
"Pushing bad regex to production, chaos monkey code causing cascading network failure, etc.", in response to a comment from someone who previously worked at Cloudflare.
[1] https://news.ycombinator.com/item?id=20415608
They mentioned it was a regular expression in the original post[0] on the day of the incident, that part isn't news (discussion here[1]).
[0]: https://news.ycombinator.com/item?id=20336332
I agree this is an awesome post and a really great example of how every Root Cause Analysis needs to be done. I am also impressed by their incident response.