Comment by blr246
6 years ago
Appreciate the detail here. It's a great writeup. Wondering what folks think about one of the changes:
5. Changing the SOP to do staged rollouts of rules in
the same manner used for other software at Cloudflare
while retaining the ability to do emergency global
deployment for active attacks.
One concern I'd have is whether or not I'm exercising the global rollout procedure often enough to be confident it works when it's needed. Of the hundreds of WAF rule changes rolled out every month, how many are global emergencies?
It's a fact of managing process that branches are liability and the hot path is the thing that will have the highest level of reliability. I wonder if anyone there has concerns about diluting the rapid response path (the one having the highest associated risk) by making this process change.
edit: fix verbatim formatting
Yep, that's the exact bullet point I was writing a response on. Security and abuse are of course special little snowflakes, with configs that need to be pushed very fast, contrary to all best practices for safe deployments of globally distributed systems. An anti-abuse rule that takes three days to roll out might as well not exist.
The only way this makes sense is if they mean that there'll be a staged rollout of some sort, but it won't be the same process as for the rest of their software. I.e. for this purpose you need much faster staging just due to the problem domain, but even a 10 minute canary should provide meaningful push safety against this kind of catastrophic meltdown. And the emergency process is something you'll use once every five years.
Your response highlights a good idea to mitigate the risk I was trying to highlight in mine.
They want to have a rapid response path (little to no delay using staging envs) to respond to emergencies. The old SOP allowed all releases to use the emergency path. By not using it in the SOP anymore, I'd be concerned that it would break silently from some other refactor or change.
Your notion is to maintain the emergency rollout as a relaxation of the new SOP such that the time in staging is reduced to almost nothing. That sounds like a good idea since it avoids maintaining two processes and having greater risk of breakage. So, same logic but using different thresholds versus two independent processes.
Right. The emergency path is either something you end up using always, or something you use so rarely that it gets eaten by bit-rot before it gets ever used[0]. So I think we're in full agreement on your original point. This was just an attempt to parse a working policy out of that bullet point.
[0] My favorite example of this had somebody accidentally trigger an ancient emergency config push procedure. It worked, made a (pre-canned) global configuration change that broke everything. Since the change was made via this non-standard and obsolete method, rolling it back took ages. Now, in theory it should have been trivial. But in practice, in the years since the functionality had been written (and never used), somehow all humans had lost the rights to override the emergency system.
1 reply →
> Security and abuse are of course special little snowflakes, with configs that need to be pushed very fast, contrary to all best practices for safe deployments of globally distributed systems.
Once upon a time, I worked on a system where many values which would otherwise be statically defined in similar systems where instead put into a database table. This particular system didn't have a proper testing and deployment pipeline set up, so whereas a normal system would just change the static value at some hard-coded point in the code and quickly roll it out, this system needed to keep it in the database so that it would be changeable in between manual deployments (months or even years apart). The ability to change the value facing the user by changing the value in the database inflated the time it took to test a release, thus exacerbating the amount of time it took to release a new version, but well, it worked.
My point is that if security and abuse rules need to be rolled out quickly, then the system needs security and abuse systems where the entire range of security and abuse configurations (i.e. their types) are a testable part of the original pipeline. Then the configurations can safely be changed on the fly, so long as the changes type-check.
It's easy to understand why it's never been built though - you'd need both a security background and a Haskell-ish/type-theory kind of background. Best of luck finding people like that.
The main problem is that their Regex library doesn't have a recrusion limit. I'm honestly amazed they've been able to scale Lua scripts to the point they can use it as a global WAF. Knowing this, it may be easy to create attacks against their filters.
My takeaway is that it's time to move to a custom solution using a more flexible language. A simple async watchdog on total rule execution time would have prevented this. When running tons of Regex rules I'm amazed they didn't have this
I am wondering why you are being downvoted. This outage could have been prevented with better deployment procedures too.
For example my company (nowhere near the scale of Cloudflare) does progressive deployments. New code is deployed only to a handful machines first, and then as the hours pass and checks remain green it propagates to the rest of the server fleet. Full deployment takes 24 hours. We never had code breaking changes in production in the past 3 years. And before that, us breaking things was the most common occurence for production issues. Of course that's not the only thing we do, good test practices, code reviews etc.
The second thing, is separation of monitoring and production. If production going down takes down the monitoring systems too, you will have a very hard time figuring out what's wrong. Cloudflare says "We had difficulty accessing our own systems because of the outage". That sounds very bad.
I 'd wager there are many wrong things at play here other than "regex is hard". But I guess HN loves cloudflare way too much to ask the hard questions.
Yeah, they get some points for admitting WAF rule updates bypass canary deployments so that they can be applied ASAP. But still.
Recursion attacks against Regex are extremely well known. The only reason I can fathom for not having an execution time watchdog is that Nginx Lua runtime doesn't allow it. I assume the scripts run during a single request cycle on one thread due to Nginx async IO (one thread per core only).
That's still no excuse. They admit to running THOUSANDS of Regex rules in custom Lua scripts embedded in Nginx. This sounds like a bad idea to anyone that knows anything about software because it is.
My previous employer embedded way too much Lua script inside Nginx plugins for the same reasons (it's easy). Even at our "scale" (50 requests/second) we had constant issues. To think they run ~10% of internet traffic on such a rube Goldberg machine is proof you can use just about anything in prod (until it inevitably explodes at least)
Confused what this response is trying to say? Did you read the whole post? They addressed exactly those two things and explained how they're fixing them. You're just repeating part the blog post essentially; which is why I wonder if you finished reading it.
I'm interested in why they wouldn't use LPeg instead. Those seem a lot easier to compose, reason about and debug; plus they have restricted backtracking.
They still retain the global rollout for the other use cases detailed in the write up, so its generally tested, though not for this one use case as you point out. I suspect the tradeoff is reasonable, however having a short pre-stage deploy before global in all cases would be a more conservative option that would prevent an emergent push from becoming an even bigger emergency!
One way of dealing with this is regular drills. My employer has a cross-cutting rotation that exercises stuff like this weekly.