← Back to context

Comment by throwaway66666

6 years ago

I am wondering why you are being downvoted. This outage could have been prevented with better deployment procedures too.

For example my company (nowhere near the scale of Cloudflare) does progressive deployments. New code is deployed only to a handful machines first, and then as the hours pass and checks remain green it propagates to the rest of the server fleet. Full deployment takes 24 hours. We never had code breaking changes in production in the past 3 years. And before that, us breaking things was the most common occurence for production issues. Of course that's not the only thing we do, good test practices, code reviews etc.

The second thing, is separation of monitoring and production. If production going down takes down the monitoring systems too, you will have a very hard time figuring out what's wrong. Cloudflare says "We had difficulty accessing our own systems because of the outage". That sounds very bad.

I 'd wager there are many wrong things at play here other than "regex is hard". But I guess HN loves cloudflare way too much to ask the hard questions.

Yeah, they get some points for admitting WAF rule updates bypass canary deployments so that they can be applied ASAP. But still.

Recursion attacks against Regex are extremely well known. The only reason I can fathom for not having an execution time watchdog is that Nginx Lua runtime doesn't allow it. I assume the scripts run during a single request cycle on one thread due to Nginx async IO (one thread per core only).

That's still no excuse. They admit to running THOUSANDS of Regex rules in custom Lua scripts embedded in Nginx. This sounds like a bad idea to anyone that knows anything about software because it is.

My previous employer embedded way too much Lua script inside Nginx plugins for the same reasons (it's easy). Even at our "scale" (50 requests/second) we had constant issues. To think they run ~10% of internet traffic on such a rube Goldberg machine is proof you can use just about anything in prod (until it inevitably explodes at least)

Confused what this response is trying to say? Did you read the whole post? They addressed exactly those two things and explained how they're fixing them. You're just repeating part the blog post essentially; which is why I wonder if you finished reading it.