Comment by _wmd
6 years ago
So in response to a catastrophic failure due to testing in prod, they're going to push out a brand new regex engine with an ETA of 2 weeks. Can anyone say testing in prod?
The constant use of 'I' and 'me' (19 occurrences in total) deeply tarnishes this report, and repeatedly singling out a responsible engineer, nameless or not, is a failure in its own right. This was a collective failure, any individual identity is totally irrelevant. We're not looking for an account of your superman-like heroism, sprinting from meeting rooms or otherwise, we want to know whether anything has been learned in the 2 years since Cloudflare leaked heap all across the Internet without noticing, and the answer to that seems fantastically clear.
This report is written by me, the CTO of Cloudflare. I say "I" throughout because organizational failings are my responsibilty. If I'd said "we" I imagine you'd be criticizing me for NOT taking responsibility.
If you read the report you'd see I do not blame the engineer responsible at all. Not once. I made that perfectly clear.
I wonder if you are able to talk a bit about the development of the Lua-based WAF. I imagine the possible unbounded performance of feeding requests into PCRE must have occurred to you or others at the time - or at least, long before this outage.
I don't mean this as some sort of lame 'lol shoulda known better' dunk - stories about technical organizations' decision-making and tradeoff-handling are just more interesting than the details of how regexes typed in a control panel grow up to become Jira tickets.
I did a talk about this years ago: https://www.youtube.com/watch?v=nlt4XKhucS4
1 reply →
Wow, I'm amazed two people could read that writeup (yourself and myself) and come to two totally different conclusions.
Pushing out a brand new regex engine surely will go through the usual process. This doesn't seem like it will take a lot of time unless there are surprises. Cloudflare clearly has the infrastructure in place already to do a proper integration test for correctness test and rampup infrastructure to ensure it doesn't cause a global outage. The global nature of this outage was because the rampup infrastructure was explicitly not used as per the protocol.
I have no idea what you read where a single engineer was singled out. At several points in this post mortem the author identifies that the regex being written by the individual involved was far from the only cause of the outage. This is a very textbook blameless post mortem doc afaict.
The narrative about the actions taken and meetings which were in is also par for the course for a good post mortem since these variables are real, and should be addressed by remediation items if they contributed to the outage. (For example, is it sane that the entire engineering team was synchronously in a meeting? Probably not.)
It seems we're reading different blog posts. Under the "What went wrong" section there are 11 points, all with differing levels of responsibility and ownership. He did well to identify the collective nature of this failure.
I don't see why switching to a new regex implementation would be so scary. 2 weeks to test that your regexes don't break seems fine? Seems like a long time tbh.
On top of that they're switching to more constrained regex engines. Rust's regex engine makes guarantees about its running time, something that would have directly mitigated a portion of the issue. And it isn't as if RE2/Rust regex aren't in use anywhere, rust's regex engine is integrated into vscode, for example.
Personal attacks aren't allowed on HN, and please don't post in the flamewar style here generally.
https://news.ycombinator.com/newsguidelines.html
You are overreacting and protecting your preferred people. What is HN running on again?
If this is a personal attack, there are literally 10-50 of these per day in arbitrary threads.
That comment was breaking the site guidelines, quite badly in fact. We moderate comments like that the same way regardless of who or what they're about.
> there are literally 10-50 of these per day in arbitrary threads
If you can find cases of this where moderators didn't respond, I'd like to see links. The likeliest explanation is simply that we didn't see it. We don't come close to seeing everything that gets posted here, so we depend on users, via flagging (https://news.ycombinator.com/newsfaq.html) or by emailing hn@ycombinator.com.
> What is HN running on again?
I suppose I have to answer this or someone will concoct a sinister reason why I didn't. HN doesn't run on Cloudflare.
You can easily duplicate traffic into a test infrastructure that wouldn't affect the production environment, and you're acting as if re2 et al hasn't had plenty of testing too. 2 weeks with the level of traffic (test data) that Cloudflare gets seems pretty realistic.