Comment by aavshr

3 months ago

> In short, a latent bug in a service underpinning our bot mitigation capability started to crash after a routine configuration change we made. That cascaded into a broad degradation to our network and other services. This was not an attack.

From the CTO, Source: https://x.com/dok2001/status/1990791419653484646

24 comments

aavshr

__turbobrew__ 3 months ago

It still astounds me that the big dogs still do not phase config rollouts. Code is data, configs are data, they are one and the same. It was the same issue with the giant crowdstrike outage last year, they were rawdogging configs globally and a bad config made it out there and everything went kaboom.

You NEED to phase config rollouts like you phase code rollouts.

crazygringo 3 months ago
The big dogs absolutely do phase config rollouts as a general rule.
There are still two weaknesses:
1) Some configs are inherently global and cannot be phased. There's only one place to set them. E.g. if you run a webapp, this would be configs for the load balancer as opposed to configs for each webserver
2) Some configs have a cascading effect -- even though a config is applied to 1% of servers, it affects the other servers they interact with, and a bad thing spreads across the entire network
- creatonez 3 months ago
  
  > Some configs are inherently global and cannot be phased
  This is also why "it is always DNS". It's not that DNS itself is particularly unreliable, but rather that it is the one area where you can really screw up a whole system by running a single command, even if everything else is insanely redundant.
  
  8 replies →
siegecraft 3 months ago
I think it's uncharitable to jump to the conclusion that just because there was a config-based outage they don't do phased config rollouts. And even more uncharitable to compare them to crowdstrike.
- __turbobrew__ 3 months ago
  
  I have read several cloudflare postmortems and my confidence in their systems is pretty low. They used to run their entire control plane out of a single datacenter which is amateur hour for a tech company that has over $60 billion in market cap.
  I also don’t understand how it is uncharitable to compare them to crowdstrike as both companies run critical systems that affect a large number of people’s lives, and both companies seem to have outages at a similar rate (if anything, cloudflare breaks more often than crowdstrike).
- __turbobrew__ 3 months ago
  
  https://blog.cloudflare.com/18-november-2025-outage/
  > The larger-than-expected feature file was then propagated to all the machines that make up our network
  > As a result, every five minutes there was a chance of either a good or a bad set of configuration files being generated and rapidly propagated across the network.
  I was right. Global config rollout with bad data. Basically the same failure mode of crowdstrike.
- cyberpunk 3 months ago
  
  It seem fairly logical to me? If a config change causes services to crash then rollout stops … at least in every phased rollout system i’ve ever built…
JohnMakin 3 months ago

In a company I am no longer with I argued much the same when we rolled out "global CI/CD" on IAC. You made one change, committed and pushed, wham it's on 40+ server clusters globally. I hated it. The principal was enamored with it, "cattle not pets" and all that, but the result was things slowed down considerably because anyone working with it became so terrified of making big changes.
wbl 3 months ago

Then you get customer visible delays.
immibis 3 months ago

Because adversaries adapt quickly, they have a system that deploys their counter-adversary bits quickly without phasing - no matter whether they call them code or configs. See also: Crowdstrike.
himinlomax 3 months ago

You can't protect against _latent bugs_ with phased rollouts.

JohnMakin 3 months ago

Wish this could rocket to the top of the comment thread, digging through hundreds of comments speculating about a cyberattack to find this felt silly

imdsm 3 months ago

Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?

sammy2255 3 months ago
Pre market was red for all tech stocks today before the outage even happened
- hbbio 3 months ago
  
  Yes, if anything it's bullish on CloudFlare because many investors don't realize how pervasive it is.
nobody9999 3 months ago

>Configuration changes are dangerous for CF it seems, and knocked down $NET almost 4% today. I wonder what the industry wide impact is for each of these outages?
This is becoming the "new normal." It seems like every few months, there's another "outage" that takes down vast swathes of internet properties, since they're all dependent on a few platforms and those platforms are, clearly, poorly run.
This isn't rocket surgery here. Strong change management, QA processes and active business continuity planning/infrastructure would likely have caught this (or not), as is clear from other large platforms that we don't even think about because outages are so rare.
Like airline reservations systems[0], credit card authorization systems from VISA/MasterCard, American Express, etc.
Those systems (and others) have outages in the "once a decade" or even much, much, longer ranges. Are the folks over at SABRE and American Express that much smarter and better than Cloudflare/AWS/Google Cloud/etc.? No. Not even close. What they are is careful as they know their business is dependent on making sure their customers can use their services anytime/anywhere, without issue.
It amazes me the level of "Stockholm Syndrome"[1] expressed by many posting to this thread, expressing relief that it wasn't "an attack" and essentially blaming themselves for not having the right tools (API keys, etc.) to recover from the gross incompetence of, this time at least, Cloudflare.
I don't doubt that I'll get lots of push back from folks claiming, "it's hard to do things at scale," and/or "there are way too many moving parts," and the like.
Other organizations like the ones I mention above don't screw they're customers every 4-6 months with (clearly) insufficiently tested configuration and infrastructure changes.
Yet many here seem to think that's fine, even though such outages are often crushing to their businesses. But if the customers of these huge providers don't demand better, they'll only get worse. And that's not (at least in my experience) a very deep or profound idea.
[0] https://en.wikipedia.org/wiki/Airline_reservations_system
[1] https://en.wikipedia.org/wiki/Stockholm_syndrome