Cloudflare outage on December 5, 2025

2 months ago (blog.cloudflare.com)

This is architectural problem, the LUA bug, the longer global outage last week, a long list of earlier such outages only uncover the problem with architecture underneath. The original, distributed, decentralized web architecture with heterogeneous endpoints managed by myriad of organisations is much more resistant to this kind of global outages. Homogeneous systems like Cloudflare will continue to cause global outages. Rust won't help, people will always make mistakes, also in Rust. Robust architecture addresses this by not allowing a single mistake to bring down myriad of unrelated services at once.

  • I’m not sure I share this sentiment.

    First, let’s set aside the separate question of whether monopolies are bad. They are not good but that’s not the issue here.

    As to architecture:

    Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

    But there’s a more interesting argument in favour of the status quo.

    Assuming cloudflare’s uptime is above average, outages affecting everything at once is actually better for the average internet user.

    It might not be intuitive but think about it.

    How many Internet services does someone depend on to accomplish something such as their work over a given hour? Maybe 10 directly, and another 100 indirectly? (Make up your own answer, but it’s probably quite a few).

    If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.

    On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

    It’s not really bad end user experience that every service uses cloudflare. It’s more-so a question of why is cloudflare’s stability seeming to go downhill?

    And that’s a fair question. Because if their reliability is below average, then the value prop evaporates.

    • > If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

      The point is that it doesn’t matter. A single site going down has a very small chance of impacting a large number of users. Cloudflare going down breaks an appreciable portion of the internet.

      If Jim’s Big Blog only maintains 95% uptime, most people won’t care. If BofA were at 95%.. actually same. Most of the world aren’t BofA customers.

      If Cloudflare is at 99.95% then the world suffers

      18 replies →

    • > On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

      I think the parent post made a different argument:

      - Centralizing most of the dependency on Cloudflare results in a major outage when something happens at Cloudflare, it is fragile because Cloudflare becomes the single point of failure. Like: Oh Cloudflare is down... oh, none of my SaaS services work anymore.

      - In a world where this is not the case, we might see more outages, but they would be smaller and more contained. Like: oh, Figma is down? fine, let me pickup another task and come back to Figma once it's back up. It's also easier to work around by having alternative providers as a fallback, as they are less likely to share the same failure point.

      As a result, I don't think you'll be blocked 100 hours a year in scenario 2. You may observe 100 non-blocking inconveniences per year, vs a completely blocking Cloudflare outage.

      And in observed uptime, I'm not even sure these providers ever won. We're running all our auxiliary services on a decent Hetzner box with a LB. Say what you want, but that uptime is looking pretty good compared to any services relying on AWS (Oct 20, 15 hours), Cloudflare (Dec 5 (half hour), Nov 18 (3 hours)). Easier to reason about as well. Our clients are much more forgiving when we go down due to Azure/GCP/AWS/Cloudflare vs our own setup though...

    • > If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year. > On the other hand, if each service experiences the same hour per year of downtime but at different times, then the person is likely to be blocked for closer to 100 hours per year.

      Putting Cloudflare in front of a site doesn't mean that site's backend suddenly never goes down. Availability will now be worse - you'll have Cloudflare outages* affecting all the sites they proxy for, along with individual site back-end failures which will of course still happen.

      * which are still pretty rare

    • > If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

      I’m tired of this sentiment. Imagine if people said, why develop your own cloud offering? Can you really do better than VMWare..?

      Innovation in technology has only happened because people dared to do better, rather than giving up before they started…

    • "My architecture depends upon a single point of failure" is a great way to get laughed out of a design meeting. Outsourcing that single point of failure doesn't cure my design of that flaw, especially when that architecture's intended use-case is to provide redundancy and fault-tolerance.

      The problem with pursuing efficiency as the primary value prop is that you will necessarily end up with a brittle result.

      8 replies →

    • That's an interesting point, but in many (most?) cases productivity doesn't depend on all services being available at the same time. If one service goes down, you can usually be productive by using an alternative (e.g. if HN is down you go to Reddit, if email isn't working you catch up with Slack).

      3 replies →

    • On the other hand, if one site is down you might have alternatives. Or, you can do something different until the site you needed is up again. Your argument that simultaneous downtime is more efficient than uncoordinated downtime because tasks usually rely on multiple sites being online simultaneously is an interesting one. Whether or not that's true is an empirical question, but I lean toward thinking it's not true. Things failing simultaneously tends to have worse consequences.

    • Paraphrasing: We are setting aside the actual issue and looking for a different angle.

      To me this reads as a form of misdirection, intentional or not. A monopolist has little reason to care about downstream effects, since customers have nowhere else to turn. Framing this as roll your own versus Cloudflare rather than as a monoculture CDN environment versus a diverse CDN ecosystem feels off.

      That said, the core problem is not the monopoly itself but its enablers, the collective impulse to align with whatever the group is already doing, the desire to belong and appear to act the "right way", meaning in the way everyone else behaves. There are a gazillion ways of doing CDN, why are we not doing them? Why the focus on one single dominant player?

      2 replies →

    • That’s fine if it’s just some random office workers. What if every airline goes down at the same time because they all rely on the same backend providers? What if every power generator shuts off? “Everything goes down simultaneously” is not, in general, something to aim for.

      2 replies →

    • CloudFlare doesn’t have a good track record. It’s the third party that caused more outages for us than any other third party service in the last four years.

    • > If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

      I disagree; most people need only a subset of Cloudflare's features. Operating just that subset avoids the risk of the other moving parts (that you don't need anyway) ruining your day.

      Cloudflare is also a business and has its own priorities like releasing new features; this is detrimental to you because you won't benefit from said feature if you don't need it, yet still incur the risk of the deployment going wrong like we saw today. Operating your own stack would minimize such changes and allow you to schedule them to a maintenance window to limit the impact should it go wrong.

      The only feature Cloudflare (or its competitors) offers that can't be done cost-effectively yourself is volumetric DDoS protection where an attacker just fills your pipe with junk traffic - there's no way out of this beyond just having a bigger pipe, which isn't reasonable for any business short of an ISP or infrastructure provider.

      1 reply →

    • > If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

      That's a wrong way of looking at it though. For 99.99% individual sites, I wouldn't care if they were down for weeks. Even if I use this site, there are very few sites that I need to use daily. For the rest of them, if one of them randomly goes down I probably would never know or notice, because I didn't need it then. However, when single-point-of-failure provider, like Cloudflare, goes down, you bet I notice. I must notice, because my work would be affected, my CI/CD pipelines will start failing, my newsfeeds will stop, I will notice it in dozens of places - because everybody uses it. The aggregated fails-per-unit-of-time may be less but the impact of each fail is way, way more, and the probability of it impacting me is approaching certainty.

      So for me, as an average internet user, it would be much better if all the world wouldn't go down at once, even if the instances of particular things going down would be more frequent - provided they are randomly distributed in time and not concentrated. If just one thing goes down, I could do another thing. If everything goes down, I can only sit and twiddle my thumbs until it's back up.

    • > If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.

      This doesn’t guarantee availability of those N services themselves though, surely? N services with a slightly lower availability target than N+1 with a slightly higher value?

      More importantly, I’d say that this only works for non-critical infrastructure, and also assumes that the cost of bringing that same infrastructure back is constant or at least linear or less.

      The 2025 Iberian Peninsula outage seems to show that’s not always the case.

    • If you’re using 10 services and 1 goes down, there’s a 9/10 chance you’re not using it and you can switch to work on something else. If all 10 go down you are actually blocked for an hour. Even 5 years ago, I can’t recall ever being actually impacted by an outtage to the extent that I was like “well, might as well just go get something to eat because everything is down”.

    • > If everything goes offline for one hour per year at the same time, then a person is blocked and unproductive for an hour per year.

      The consequence of some services being offline is much, much worse than a person (or a billion) being bored in front of a screen.

      Sure, it’s arguably not Cloudflares fault that these services are cloud-dependent in the first place, but even if service just degrades somewhat gracefully in an ideal case, that’s a lot of global clustering of a lot of exceptional system behavior.

      Or another analogy: Every person probably passes out for a few minutes in their live at one point or another. Yet I wouldn’t want to imagine what happens if everybody got that over with at the very same time without warning…

    • > Cloudflare has had some outages recently. However, what’s their uptime over the longer term? If an individual site took on the infra challenges themselves, would they achieve better? I don’t think so.

      Why is that the only option? Cloudflare could offer solutions that let people run their software themselves, after paying some license fee. Or there could be many companies people use instead, instead of everyone flocking to one because of cargoculting "You need a CDN like Cloudflare before you launch your startup bro".

      3 replies →

    • When I’m working from home and the internet goes down, I don’t care. My poor private-equity owned corporation, think of the lost productivity!!

      But if I was trying to buy insulin at 11 pm before benefits expire, or translate something at a busy train station in a foreign country, or submit my take-home exam, I would be freeeaaaking out.

      The cloudflare-supported internet does a whole lot of important, time-critical stuff.

    • All of my company's hosted web sites have way better uptimes and availability than CF but we are utterly tiny in comparison.

      With only some mild blushing, you could describe us as "artisanal" compared to the industrial monstrosities, such as Cloudflare.

      Time and time again we get these sorts of issues with the massive cloudy chonks and they are largely due to the sort of tribalism that used to be enshrined in the phrase: "no one ever got fired for buying IBM".

      We see the dash to the cloud and the shoddy state of in house corporate IT as a result. "We don't need in-house knowledge, we have "MS copilot 365 office thing" that looks after itself and now its intelligent - yay \o/

      Until I can't, I'm keeping it as artisanal as I can for me and my customers.

      2 replies →

  • In other words, the consolidation on Cloudflare and AWS makes the web less stable. I agree.

    • Usually I am allergic to pithy, vaguely dogmatic summaries like this but you're right. We have traded "some sites are down some of the time" for "most sites are down some of the time". Sure the "some" is eliding an order of magnitude or two, but this framing remains directionally correct.

      7 replies →

  • Would you rather be attacked by 1,000 wasps or 1 dog? A thousand paper cuts or one light stabbing? Global outages are bad but the choice isn’t global pain vs local pleasure. Local and global both bring pain, with different, complicated tradeoffs.

    Cloudflare is down and hundreds of well paid engineers spring into action to resolve the issue. Your server goes down and you can’t get ahold of your Server Person because they’re at a cabin deep in the woods.

    • It's not "1,000 wasps or 1 dog", it's "1,000 dogs at once, or "1 dog at once, 1,000 different times". Rare but huge and coordinated siege, or a steady and predictable background radiation of small issues.

      The latter is easier to handle, easier to fix, and much more suvivable if you do fuck it up a bit. It gives you some leeway to learn from mistakes.

      If you make a mistake during the 1000 dog siege, or if you don't have enough guards on standby and ready to go just in case of this rare event, you're just cooked.

      3 replies →

    • If you've allowed your Server Person to be a single point of failure out innawoods, that's an organizational problem, not a technological one.

      Two is one and one is none.

    • Why would there be a centralized outage of decentralized services? The proper comparison seems to be attacked by a dog or a single wasp.

    • In most cases we actually get both local and global pain, since most people are running servers behind Cloudflare.

  • What you've identified here is a core part of what the banking sector calls the "risk based approach". Risk in that case is defined as the product of the chance of something happening and the impact of it happening. With this understanding we can make the same argument you're making, a little more clearly.

    Cloudflare is really good at what they do, they employ good engineering talent, and they understand the problem. That lowers the chance of anything bad happening. On the other hand, they achieve that by unifying the infrastructure for a large part of the internet, raising the impact.

    The website operator herself might be worse at implementing and maintaining the system, which would raise the chance of an outage. Conversely, it would also only affect her website, lowering the impact.

    I don't think there's anything to dispute in that description. The discussion then is if cloudflares good engineering lowers the chance of an outage happening more than it raises the impact. In other words, the things we can disagree about is the scaling factors, the core of the argument seems reasonable to me.

  • > Homogeneous systems like Cloudflare will continue to cause global outages

    But the distributed system is vulnerable to DDOS.

    Is there an architecture that maintains the advantages of both systems? (Distributed resilience with a high-volume failsafe.)

  • You should really check Cloudflare.

    There is not a single company that makes their infrastructure as globally available like Cloudflare.

    Additionally, the downtime of Cloudflare seems to be objectively less than the others.

    Now, it took 25 minutes for 28% of the network.

    While being the only ones to fix a global vulnerability.

    There is a reason other clouds wouldn't touch the responsiveness and innovation that Cloudflare brings.

  • Robust architecture that is serving 80M requests/second worldwide?

    My answer would be that no one product should get this big.

  • On the other hand, as long as the entire internet goes down when Cloudflare goes down, I'll be able to host everything there without ever getting flack from anyone.

  • Actually, maybe 1 hour downtime for ~ the whole internet every month is a public good provided by Cloudflare. For everyone that doesn’t get paged, that is.

  • It's not as simple as that. What will result in more downtime, dependency on a single centralized service or not being behind Cloudflare? Clearly it's the latter or companies wouldn't be behind Cloudflare. Sure, the outages are more widespread now than they used to be, but for any given service the total downtime is typically much lower than before centralization towards major cloud providers and CDNs.

  • > Rust won't help, people will always make mistakes, also in Rust.

    They don't just use Rust for "protection", they use it first and foremost for performance. They have ballpark-to-matching C++ performance with a realistic ability to avoid a myriad of default bugs. This isn't new.

    You're playing armchair quarterback with nothing to really offer.

  • I find this sentiment amusing when I consider the vast outages of the "good ol' days".

    What's changed is a) our second-by-second dependency on the Internet and b) news/coverage.

  • Notwithstanding that most people using Cloudflare aren't even benefiting from what it actually provides. They just use it...because reasons.

  • Not too long ago, critical avionics were programmed by different software developers and the software was run on different hardware architectures, produced by different manufacturers. These heterogeneous systems produced combined control outputs via a quorum architecture – all in a single airplane.

    Now half of the global economy seems to run on same service provider, it seems…

  • Reductionist, but it's a backup problem.

    Data matters? Have multiple copies, not all in the same place.

    This is really no different, yet we don't have those redundancies in play.

    Host, and paths.

    Every other take is ultimately just shuffling justification around the least bad for everyone lack of backups for cost saving.

  • Obviously Rust is the answer to these kind of problems. But if you are cloudflare and have an important company at a global scale, you need to set high standarts for your rust code. Developers should dance and celebrate end of the day if their code compiles in rust.

  • You're not wrong, but where's the robust architecture you're referring to? The reality of providing reliable services on the internet is far beyond the capabilities of most organizations.

    • I think it might be a organizational architecture that needs to change.

      > However, we have never before applied a killswitch to a rule with an action of “execute”.

      > This is a straightforward error in the code, which had existed undetected for many years

      So they shipped an untested configuration change that triggered untested code straight to production. This is "tell me you have no tests without telling me you have no tests" level of facepalm. I work on safety-critical software where if we had this type of quality escape both internal auditors and external regulators would be breathing down our necks wondering how our engineering process failed and let this through. They need to rearchitect their org to put greater emphasis on verification and software quality assurance.

  • Yeah, redundancy and efficiency are opposites. As engineers, we always chase efficiency, but resilience and redundancy are related.

  • You have a heterogeneous, fault-free architecture for the Cloudflare problem set? Interesting! Tell us more.

Kudos to Cloudflare for clarity and diligence.

When talking of their earlier Lua code:

> we have never before applied a killswitch to a rule with an action of “execute”.

I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?

It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.

I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

  • "Kudos"? This is like the South Park episode in which the oil company guy just excuses himself while the company just continues to fuck up over and over again. There's nothing to praise, this shouldn't happen twice in a month. Its inexcusable.

  • It's weird reading these reports because they don't seem to test anything at all (or at least there's very little mention of testing).

    Canary deployment, testing environments, unit tests, integration tests, anything really?

    It sounds like they test by merging directly to production but surely they don't

    • The problem is that Cloudflare do incremental rollouts and loads of testing for _code_. But they don't do the same thing for configuration - they globally push out changes because they want rapid response.

      It's still a bit silly though, their claimed reasoning probably doesn't really stack up for most of their config changes - I don't see it to be that likely that a 0.1->1->10->100 rollout over the period of 10 minutes would be a catastrophically bad idea for them for _most_ changes.

      And to their credit, it does seem they want to change that.

      1 reply →

    • In the post they described that they observed errors happening in their testing env, but decided to ignore because they were rolling out a security fix. I am sure there is more nuance to this, but I don’t know whether that makes it better or worse

      1 reply →

  • This is funny, considering that someone that worked on the defense industry (guide missile system) found a memory leak on one of their products, at that time. They told him that they knew about it, but that it's timed just right with the range of the system it would be used, so it doesn't matter.

    • Having observed an average of two mgmt rotations at most of the clients our company is working for this comes at absolutely no surprise to me. Engineering is acting perfectly reasonable, optimizing for cost and time within the constraints they were given. Constraints are updated at a (marketing or investor pleasure) whim without consulting engineering, cue disaster. Not even surprising to me anymore...

    • My hunch is that we do the same with memory leaks or other bugs in web applications where the time of a request is short.

I've noticed that in recent months, even apart from these outages, cloudflare has been contributing to a general degradation and shittification of the internet. I'm seeing a lot more "prove you're human", "checking to make sure you're human", and there is normally at the very least a delay of a few seconds before the site loads.

I don't think this is really helping the site owners. I suspect it's mainly about AI extortion:

https://blog.cloudflare.com/introducing-pay-per-crawl/

  • You call it extortion of the AI companies, but isn’t stealing/crawling/hammering a site to scrape their content to resell just as nefarious? I would say Cloudflare is giving these site owners an option to protect their content and as a byproduct, reduce their own costs of subsidizing their thieves. They can choose to turn off the crawl protection. If they aren't, that tells you that they want it, doesn’t it?

  • Ive been seeing more of those prove your human pages as well, but I generally assume they are there to combat a DDOS or other type of attack (or maybe ai/bot). I remember how annoying it was combating DDOS attacks, or hacked sites before Cloudflare existed. I also remember how annoying capcha s were, everywhere. Cloudflare is not perfect but net, I think it’s been a great improvement.

  • More and more sites I can't even visit because of this "prove you're human" because it's not compatible with older web browsers, even though the website it's blocking is.

  • the two things are unrelated...

    The pay-per-crawl thing, is about them thinking ahead about post-AI business/revenue models.

    The way AI happened, it removed a big chunk of revenue from news companies, blogs, etc. Because lots of people go to AI instead of reaching the actual 3rd party website.

    AI currently gets the content for free from the 3rd party websites, but they have revenue from their users.

    So Cloudflare is proposing that AI companies should be paying for their crawling. Cloudflare's solution would give the lost revenue back where it belongs, just through a different mechanism.

    The ugly side of the story is that this was already an existing solution, and open source, called L402.org.

    Cloudflare wants to be the first to take a piece of the pie, but also instead of using the open source version, they forked it internally and published it as their own service, which is cloudflare specific.

    To be completely fair, the l402 requires you to solve the payment mechanism itself, which for Cloudflare is easy because they already deal with payments.

  • > I've noticed that in recent months, even apart from these outages, cloudflare has been contributing to a general degradation and shittification of the internet. I'm seeing a lot more "prove you're human", "checking to make sure you're human", and there is normally at the very least a delay of a few seconds before the site loads.

    Good to know I'm not the only one

  • Feel like that’s the fault of LLMs, not cloudflare

    • Looking into this more, it does indeed seem to be a cloudflare problem. It looks like cloudflare made a significant error in their bot fingerprinting, and Perplexity wasn't actually bypassing robots.txt.

      https://www.perplexity.ai/hub/blog/agents-or-bots-making-sen...

      To be honest I find cloudflare a much more scammy company than Perplexity. I had a DDoS attack a few years ago which originated from their network, and they had zero interest in it.

I noticed this outage last night (Cloudflare 500s on a few unrelated websites). As usual, when I went to Cloudflare's status page, nothing about the outage was present; the only thing there was a notice about the pre-planned maintenance work they were doing for the security issue, reporting that everything was being routed around it successfully.

  • This is the case with just about every status page I’ve ever seen. It takes them a while to realize there’s really a problem and then to update the page. One day these things will be automated, but until then, I wouldn’t expect more of Cloudflare than any other provider.

    What’s more concerning to me is that now we’ve had AWS, Azure, and CloudFlare (and CliudFlare twice) go down recently. My gut says:

    1. developers and IT are using LLMs in some part of the process, which will not be 100% reliable.

    2. Current culture of I have (some personal activity or problem) or we don’t have staff, AI will replace me, f-this.

    3. Pandemic after effects.

    4. Political climate / war / drugs; all are intermingled.

    • >It takes them a while to realize there’s really a problem and then to update the page.

      Not really, they're just lying. I mean yes of course they aren't oracles who discover complex problems in instant of the first failure, but naw they know when well there are problems and significantly underreport them to the extent they are are less "smoke alarms" and more "your house has burned down and the ashes are still smoldering" alarms. Incidents are intentionally underreported. It's bad enough that there ought to be legislation and civil penalties for the large providers who fail to report known issues promptly.

    • Those are complex and tenuous explanations for events that have occurred since long before all of your reasons came into existence.

  • Only way to change that it to shame them for it: "Cloudflare is so incompetent at detecting and managing outages that even their simple status page is unable to be accurate"

    If enough high-ranked customers report this feedback...

  • The status page was updated 6 minutes after the first internal alert was triggered (8:50 -> 8:56:26 UTC), I wouldn't say this is too long.

> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.

> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules

Warning signs like this are how you know that something might be wrong!

  • > They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.

    This is what jumped out at me as the biggest problem. A wild west deployment process is a valid (but questionable) business decision, but if you do that then you need smart people in place to troubleshoot and make quick rollback decisions.

    Their timeline:

    > 08:47: Configuration change deployed and propagated to the network

    > 08:48: Change fully propagated

    > 08:50: Automated alerts

    > 09:11: Configuration change reverted and propagation start

    > 09:12: Revert fully propagated, all traffic restored

    2 minutes for their automated alerts to fire is terrible. For a system that is expected to have no downtime, they should have been alerted to the spike in 500 errors within seconds before the changes even fully propagated. Ideally the rollback would have been automated, but even if it is manual, the dude pressing the deploy button should have had realtime metrics on a second display with his finger hovering over the rollback button.

    Ok, so they want to take the approach of roll forward instead of immediate rollback. Again, that's a valid approach, but you need to be prepared. At 08:48, they would have had tens of millions of "init.lua:314: attempt to index field 'execute'" messages being logged per second. Exact line of code. Not a complex issue. They should have had engineers reading that code and piecing this together by 08:49. The change you just deployed was to disable an "execute" rule. Put two and two together. Initiate rollback by 08:50.

    How disconnected are the teams that do deployments vs the teams that understand the code? How many minutes were they scratching their butts wondering "what is init.lua"? Are they deploying while their best engineers are sleeping?

    • > 2 minutes for their automated alerts to fire is terrible

      I take exception to that, to be honest. It's not desirable or ideal, but calling it "terrible" is a bit ... well, sorry to use the word ... entitled. For context, I have experience running a betting exchange. A system where it's common for a notable fraction of transactions in a medium-volume event to take place within a window of less than 30 seconds.

      Vast majority of current monitoring systems are built on Prometheus. (Well okay, these days it's more likely something Prom-compatible but more reliable.) That implies collection via recurring scrapes. A supposedly "high" frequency online service monitoring system does a scrape every 30 seconds. Well known reliability engineering practices state that you need a minimum of two consecutive telemetry points to detect any given event - because we're talking about a distributed system and network is not a reliable transport. That in turn means that with near-perfect reliability the maximum time window before you can detect something failing is the time it takes to perform three scrapes: thing A might have failed a second after the last scrape, so two consecutive failures will show up only after a delay of just-a-hair-shy-of-three scraping cycle windows.

      At Cloudflare's scale, I would not be surprised if they require three consecutive events to trigger an alert.

      As for my history? The betting exchange monitoring was tuned to run scrapes at 10-second intervals. That still meant that the first an alert fired for something failing could have been effectively 30 seconds after the failures manifested.

      Two minutes for something that does not run primarily financial transactions is a pretty decent alerting window.

      7 replies →

    • I see lots of people complaining about this down time but in actuality is it really that big a deal to have 30 minutes of down time or whatever. It's not like anything behind cloudflare is "mission critical" in the sense that lives are at stake or even a huge amount of money is at stake. In many developed countries the electric power service has local down times on occasion. That's more important than not being able to load a website. I agree if CF is offering a certain standard of reliability and not meeting it then they should offer prorated refunds for the unexpected down time but otherwise I am not seeing what the big deal is here.

      7 replies →

  • > Warning signs like this are how you know that something might be wrong!

    Yes, as they explain it's the rollback that was triggered due to seeing these errors that broke stuff.

  • “ Uh...it's probably not a problem...probably...but I'm showing a small discrepancy in...well, no, it's well within acceptable bounds again. Sustaining sequence. Nothing you need to worry about, Gordon. Go ahead.“

  • "Hey, this change is making the 'check engine' light turn on all the time. No problem; I just grabbed some pliers and crushed the bulb."

  • they arent a panacea though, internal tools like that can be super noisy on errors, and be broken more often than theyre working

Cloudflare is now below 99.9% uptime, for anyone keeping track. I reckon my home PC is at least 99.9%.

  • Indeed. AWS too.

    I feel like the cloud hosting companies have lost the plot. "They can provide better uptime than us" is the entire rationale that a lot of small companies have when choosing to run everything in the cloud.

    If they cost more AND they're less reliable, what exactly is the reason to not self host?

    • > If they cost more AND they're less reliable, what exactly is the reason to not self host?

      Shifting liability. You're paying someone else for it to be their problem, and if everyone does it, no one will take flak for continuing to do so. What is the average tenure of a CIO or decision maker electing to move to or remain at a cloud provider? This is why you get picked to talk on stage at cloud provider conferences.

      (have been in the meetings where these decisions are made)

    • Plus, when you self-host, you can likely fix the issue yourself in a couple of hours max, instead of waiting indefinitely for a fix or support that might never come.

      12 replies →

    • Capex vs Opex and scale-out.

      For a start-up it's much easier to just pay the Cloud tax than it is to hire people with the appropriate skill sets to manage hardware or to front the cost.

      Larger companies on the other hand? Yeah, I don't see the reason to not self host.

  • TBF, it depends on the number of outages locally. In my area it is one outage every thunderstorm/snow storm, so unfortunately the up time of my laptop, even with the help of a large, portable battery charging station (which can charge multiple laptops at the same time), is not optimistic.

    I sometimes fancy that I could just take cash, go into the wood, build a small solar array, collect & cleanse river water, and buy a starlink console.

    • Yeah, I'd guess I average a power drop once a month or so at home. Never calculated the nines of uptime average, but it's not that infrequent.

      I know when I need to reset the clock on my microwave oven.

      1 reply →

  • When a piece of hardware goes or a careless backup process fails, downtime of a self-hosted service can be measured in days or weeks.

What I'm missing here is a test environment. Gradual or not; why are they deploying straight to prod? At Cloudflare's scale, there should be a dedicated room in Cloudflare HQ with a full isolated model-scale deployment of their entire system. All changes should go there first, with tests run for every possible scenario.

Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

  • This is kinda what I'm thinking. We're absolutely not at the scale Cloudflare is at.

    But we run software and configuration changes through three tiers - first stage for the dev-team only, second stage with internal customers and other teams depending on it for integration and internal usage -- and finally production. Some teams have also split production into different rings depending on the criticality of the customers and the number of customers.

    This has lead to a bunch of discussions early on, because teams with simpler software and very good testing usually push through dev and testing with no or little problem. And that's fine. If you have a track record of good changes, there is little reason to artificially prolong deployment in dev and test just because. If you want to, just go through it in minutes.

    But after a few spicy production incidents, even the better and faster teams understood and accepted that once technical velocity exists, actual velocity is a choice, or a throttle if you want an analogy.

    If you do good, by all means, promote from test to prod within minutes. If you fuck up production several times in a row and start threatening SLAs, slow down, spend more resources on manual testing and improving automated testing, give changes time to simmer in the internally productive environment, spend more time between promotions from production ring to production ring.

    And this is on top of considerations of e.g. change risk. Some frontend-only application can move much faster than the PostgreSQL team, because one rollback is a container restart, and the other could be a multi-hour recovery from backups.

  • I am sure they have this. What tends to happen is that the gradual rollout system becomes too slow for some rare, low latency rollout requirements, so a config system is introduced that fulfills the requirements. For example, let’s say you have a gradual rollout for binaries (slow) and configuration (fast). Over time, the fast rollout of the configuration system will cause outages, so it’s slowed down. Then a requirement pops up for which the config system is too slow and someone identifies a global system with no gradual rollout (e.g. a database) to be used as the solution. That solution will be compliant with all the processes that have been introduced to the letter, because so far nobody has thought of using a single database row for global configuration yet. Add new processes whenever this happens and at some point everything will be too slow and taking on more risk becomes necessary to stay competitive. So processes are adjusted. Repeat forever.

  • > Languages with strong type systems won't save you, good procedure will.

    One of the items in the list of procedures is to use types to encode rules of your system.

Internet packet switching based architecture was originally design to withstand this type of outages [1].

Some people even go further by speculating that the original military DARPA network precursor to the modern Internet was originally designed to ensure the continuity of command and control (C&C) of the US military operation in the potential event of all out nuclear attack during the Cold War.

This the time when Internet researchers need to redefine the Internet application and operation. The local-first paradigm is the first step in the right direction (pardon the pun) [2].

[1] The Real Internet Architecture: Past, Present, and Future Evolution:

https://press.princeton.edu/books/paperback/9780691255804/th...

[2] Local-first software You own your data, in spite of the cloud:

https://www.inkandswitch.com/essay/local-first/

The lesson presented by the last few big outages is that entropy is, in fact, inescapable. The comprehensibility of a system cannot keep up with its growing and aging complexity forever. The rate of unknown unknowns will increase.

The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.

  • yet my dedicated server has been up since 2015 with zero downtimes

    i dont think this is an entropy issue its human error bubbling up and cloudflare charges a premium for it

    my faith in cloudflare is shoook for sure two major outages weeks apart ad this wont be the last

    • Why is the stability of your dedicated server a counterpoint that cloud behemoths can't keep up with their increasing entropy? Seems more like a supporting argument of OP at best, a non sequitur at worst.

    • Yeah, because it's not complex. It's 1 server. Get back to us when your 100k servers homelab data center that does a million different things has 10 years of uptime.

  • I'm not sure how decentralization helps though. People in a bazzar are going to care even less about sharing shadow knowledge. Linux IMO succeeds not because of the bazaar but because of Linus.

Whats going on with cloudflare's software team?

I have seen similar bugs in cloudflare API recently as well.

There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

What's the culture like at Cloudflare re: ops/deployment safety?

They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

  • One thing to keep in mind when judging what's 'appropriate' is that Cloudflare was effectively responding to an ongoing security incident outside of their control (the React Server RCE vulnerability). Part of Cloudlfare's value proposition is being quick to react to such threats. That changes the equation a bit: any hour you wait longer to deploy, your customers are actively getting hacked through a known high-severity vulnerability.

    In this case it's not just a matter of 'hold back for another day to make sure it's done right', like when adding a new feature to a normal SaaS application. In Cloudflare's case moving slower also comes with a real cost.

    That isn't to say it didn't work out badly this time, just that the calculation is a bit different.

    • To clarify, I'm not trying to imply that I definitely wouldn't have made the same decision, or that cowboy decisions aren't ever the right call.

      However, this preliminary report doesn't really justify the decision to use the same deployment system responsible for the 11/18 outage. Deployment safety should have been the focus of this report, not the technical details. My question that I want answered isn't "are there bugs in Cloudflare's systems" it's "has Cloudflare learned from it's recent mistakes to respond appropriately to events"

      7 replies →

    • Cloudflare had already decided this was a rule that could be rolled out using their gradual deployment system. They did not view it as being so urgent that it required immediate global roll out.

  • Rollback is a reliable strategy when the rollback process is well understood. If a rollback process is not well known and well experienced, then it is a risk in itself.

    I'm not sure of the nature of the rollback process in this case, but leaning on ill-founded assumptions is a bad practice. I do agree that a global rollout is a problem.

    • Rollback carries with it the contextual understanding of complete atomicity; otherwise it's slightly better than a yeet. It's similar to backups that are untested.

      2 replies →

    • Global rollout of security code on a timeframe of seconds is part of Cloudflare's value proposition.

      In this case they got unlucky with an incident before they finished work on planned changes from the last incident.

      1 reply →

  • > They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?

    Note that the two deployments were of different components.

    Basically, imagine the following scenario: A patch for a critical vulnerability gets released, during rollout you get a few reports of it causing the screensaver to show a corrupt video buffer instead, you roll out a GPO to use a blank screensaver instead of the intended corporate branding, a crash in a script parsing the GPOs on this new value prevents users from logging in.

    There's no direct technical link between the two issues. A mitigation of the first one merely exposed a latent bug in the second one. In hindsight it is easy to say that the right approach is obviously to roll back, but in practice a roll forward is often the better choice - both from an ops perspective and from a safety perspective.

    Given the above scenario, how many people are genuinely willing to do a full rollback, file a ticket with Microsoft, and hope they'll get around to fixing it some time soon? I think in practice the vast majority of us will just look for a suitable temporary workaround instead.

  • Roll back is not always the right answer. I can’t speak to its appropriateness in this particular situation of course, but sometimes “roll forward” is the better solution.

    • Like the other poster said, roll back should be the right answer the vast majority of the time. But it's also important to recognize that roll forward should be a replacement for the deployment you decided not to roll back, not a parallel deployment through another system.

      I won't say never, but a situation where the right answer to avoid a rollback (that it sounds like was technically fine to do, just undesirable from a security/business perspective) is a parallel deployment through a radioactive, global blast radius, near instantaneous deployment system that is under intense scrutiny after another recent outage should be about as probable as a bowl of petunias in orbit

      6 replies →

    • You want to build a world where roll back is 95% the right thing to do. So that it almost always works and you don't even have to think about it.

      During an incident, the incident lead should be able to say to your team's on call: "can you roll back? If so, roll back" and the oncall engineer should know if it's okay. By default it should be if you're writing code mindfully.

      Certain well-understood migrations are the only cases where roll back might not be acceptable.

      Always keep your services in "roll back able", "graceful fail", "fail open" state.

      This requires tremendous engineering consciousness across the entire org. Every team must be a diligent custodian of this. And even then, it will sometimes break down.

      Never make code changes you can't roll back from without reason and without informing the team. Service calls, data write formats, etc.

      I've been in the line of billion dollar transaction value services for most of my career. And unfortunately I've been in billion dollar outages.

      2 replies →

  • The question is perhaps what the shape and status of their tech stack is. Obviously, they are running at massive scale, and they have grown extremely aggressively over the years. What's more, especially over the last few years, they have been adding new product after new product. How much tech debt have they accumulated with that "move fast" approach that is now starting to rear its head?

    • I think this is probably a bigger root cause and is going to show up in different ways in future. The mere act of adding new products to an existing architecture/system is bound to create knowledge silos around operations and tech debt. There is a good reason why big companies keep smart people on their payroll to just change couple of lines after a week of debate.

  • > this sounds like the sort of cowboy decision

    Ouch. Harsh given that Cloudflare's being over-honest (to disabling the internal tool) and the outage's relatively limited impact (time wise & no. of customers wise). It was just an unfortunate latent bug: Nov 18 was Rust's Unwrap, Dec 5 its Lua's turn with its dynamic typing.

    Now, the real cowboy decision I want to see is Cloudflare [0] running a company-wide Rust/Lua code-review with Codex / Claude...

    cf TFA:

      if rule_result.action == "execute" then
        rule_result.execute.results = ruleset_results[tonumber(rule_result.execute.results_index)]
      end
    
      This code expects that, if the ruleset has action="execute", the "rule_result.execute" object will exist ... error in the [Lua] code, which had existed undetected for many years ... prevented by languages with strong type systems. In our replacement [FL2 proxy] ... code written in Rust ... the error did not occur.
    

    [0] https://news.ycombinator.com/item?id=44159166

  • From the post:

    “We have spoken directly with hundreds of customers following that incident and shared our plans to make changes to prevent single updates from causing widespread impact like this. We believe these changes would have helped prevent the impact of today’s incident but, unfortunately, we have not finished deploying them yet.

    “We know it is disappointing that this work has not been completed yet. It remains our first priority across the organization.”

  • Where I work, all teams were notified about the React CVE.

    Cloudflare made it less of an expedite.

  • > Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.

    Also there seems to be insufficient testing before deployment with very junior level mistakes.

    > As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:

    Where was the testing for this one? If ANY exception happened during the rules checking, the deployment should fail and rollback. Instead, they didn't assess that as a likely risk and pressed on with the deployment "fix".

    I guess those at Cloudflare are not learning anything from the previous disaster.

  • > more to the story

    From a more tinfoil-wearing angle, it may not even be a regular deployment, given the idea of Cloudflare being "the largest MitM attack in history". ("Maybe not even by Cloudflare but by NSA", would say some conspiracy theorists, which is, of course, completely bonkers: NSA is supposed to employ engineers who never let such blunders blow their cover.)

The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

  • That is also true at Cloudflare for what it’s worth. However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release, especially since there’s a 5 min lag (if I recall correctly) in the monitoring dashboards to get all the telemetry from thousands of servers worldwide.

    Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.

    • The fintech company I worked at does handle millions of QPS has has thousands of servers. It is on the same order of magnitude or at least 0.1x scale, not to mention the complexity of business logic involving monetary transactions.

      If there’s indeed a 5 min lag in monitoring dashboard in Cloudflare, I honestly think that's a pretty big concern.

      For example, a simple curl script on your top 100 customers' homepage that runs every 30 seconds would have given the warning and notifications within a minute. If you stagger deployments at 5 minute intervals, you could have identified the issue and initiated the rollback within 2 minutes and completed it within 3 minutes.

    • > However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release

      This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.

      13 replies →

    • With all due respect, engineers in finance can’t allow for outages like this because then you are losing massive amounts of money and potentially going out of business.

  • Cloudflare is orders of magnitude larger than any fintech. Rollouts likely take much longer, and having a human monitoring a dashboard doesn't scale.

    • That means they engineered their systems incorrectly then? Precisely because they are much bigger, they should be more resilient. You know who's bigger than Cloudflare? tier-1 ISPs, if they had an outage the whole internet would know about it, and they do have outages except they don't cascade into a global mess like this.

      Just speculating based on my experience: It's more likely than not that they likely refused to invest in fail-safe architectures for cost reasons. Control-plane and data-plane should be separate, a react patch shouldn't affect traffic forwarding.

      Forget manual rollbacks, there should be automated reversion to a known working state.

      6 replies →

    • > Rollouts likely take much longer

      Cloudflare’s own post says the configuration change that resulted in the outage rolled out in seconds.

  • My guess is that CF has so many external customers that they need to move fast and try not to break things. My hunch is that their culture always favors moving fast. As long as they are not breaking too many things, customers won't leave them.

    • There is nothing wrong with moving fast and deploying fast.

      I'm more talking about how slow it was to detect the issue caused by the config change, and perform the rollback of the config change. It took 20 minutes.

    • I think everyone favors moving fast. We humans want to see results of our action early.

  • Same, my time at a F100 ecommerce retailer showed me the same. Every change control board justification needed an explicit back-out/restoration plan with exact steps to be taken, what was being monitored to ensure that was being held to, contacts of prominent groups anticipated to have an effect, emergency numbers/rooms for quick conferences if in fact something did happen.

    The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).

Today, after the Cloudflare outage, I noticed that almost all upload routes for my applications were being blocked.

After some investigation, I realized that none of these routes passed through Cloudflare OWASP. The reported anomalies total 50, exceeding the pre-configured maximum of 40 (Medium).

Despite being simple image or video uploads, the WAF is generating anomalies that make no sense, such as the following:

Cloudflare OWASP Core Ruleset Score (+5)

933100: PHP Injection Attack: PHP Open Tag Found

Cloudflare OWASP Core Ruleset Score (+5)

933180: PHP Injection Attack: Variable Function Call Found

For now, I’ve had to raise the OWASP Anomaly Score Threshold to 60 and enable the JS Challenge, but I believe something is wrong with the WAF after today’s outage.

This issue was still not solved to this moment.

So Cloudflare: - Did a last minute, untested change to their change: "turning off our WAF rule testing tool". - Did an immediate global roll-out, instead of a staged one. . Is seems they would have enough leaning-cases now never to do that again...

My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".

But a more important takeaway:

> This type of code error is prevented by languages with strong type systems

  • That's a bizarre takeaway for them to suggest, when they had exactly the same kind of bug with Rust like three weeks ago. (In both cases they had code implicitly expecting results to be available. When the results weren't available, they terminated processing of the request with an exception-like mechanism. And then they had the upstream services fail closed, despite the failing requests being to optional sidecars rather than on the critical query path.)

    • In fairness, the previous bug (with the Rust unwrap) should never have happened: someone explicitly called the panicking function, the review didn't catch it and the CI didn't catch it.

      It required a significant organizational failure to happen. These happen but they ought to be rarer than your average bug (unless your organization is fundamentally malfunctioning, that is)

      2 replies →

    • Yeah, my first thought was that had they used Rust, maybe we would've seen them point out a rule_result.unwrap() as the issue.

    • To be precise, the previous problem with Rust was because somebody copped out and used a temporary escape hatch function that absolutely has no place in production code.

      It was mostly an amateur mistake. Not Rust's fault. Rust could never gain adoption if it didn't have a few escape hatches.

      "Damned if they do, damned if they don't" kind of situation.

      There are even lints for the usage of the `unwrap` and `expect` functions.

      As the other sibling comment points out, the previous Cloudflare problem was an acute and extensive organizational failure.

      2 replies →

  • This is the exact same type of error that happened in their Rust code last time. Strong type systems don’t protect you from lazy programming.

    • It's not remotely the same type of error -- error non-handling is very visible in the Rust code, while the Lua code shows the happy path, with no indication that it could explode at runtime.

      Perhaps it's the similar way of not testing the possible error path, which is an organizational problem.

The interesting part:

After rolling out a bad ruleset update, they tried a killswitch (rolled out immediately to 100%) which was a code path never executed before:

> However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset

> a straightforward error in the code, which had existed undetected for many years

  • > have never before applied a killswitch to a rule with an action of “execute”

    One might think a company on the scale of Cloudflare would have a suite of comprehensive tests to cover various scenarios.

    • I kinda think most companies out there are like that. Moving fast is the motto I heard the most.

      They are probably OK with occasional breaks as long as customers don't mind.

    • Yeah the example they gave does feel like pretty isolated unit test territory, or at least an integration test on a subset of the system that could be ran in isolation.

Is there some underlying factors that resulted in the recent outages (e.g., new processes, layoffs, etc.) or just a series of pure coincidences?

How hard can it be for a company with 1000 engineers to create a canary region before blasting their centralized changes out to everyone.

Every change is a deployment, even if its config. Treat it as such.

Also you should know that a strongly typed language won't save you from every type of problem. And especially not if you allow things like unwrap().

It is just mind boggling that they very obviously have completely untested code which proxies requests for all their customers. If you don't want to write the tests then at least fuzz it.

I sometimes feel we'd be better off without all the paternalistic kitchensink features. The solid, properly engineered features used intentionally aren't causing these outages.

  • Agreed, I don't really like Cloudflare trying to magically fix every web exploit there is in frameworks my site has never used.

    • I’ve been downvoted enough with my comments on this blog post where I’m hesitant to add anything else, but here I agree with you. They’re trying to be everything to everyone, where does the accountability of their customers being responsible for running, you know, up-to-date packages come in? Like, you don’t take just a little bit of pride in your work that you’re continually watching CVE lists and exploits and just have a minimum of effort toward patching your own shit, rather than pawning it off on vendor? I simply can’t understand the mindset.

> provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis.

I have a mixed feeling about this.

On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.

At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.

  • We need a decentralized ddos mitigation network based on incentives. Donate X amount of bandwidth, get Y amount of protection from other peers. Yes, we gotta do TLS inspection on every end for effective L7 mitigation but at least filtering can be done without decrypting any packets

"This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur." It's starting to sound like a broken record at this point, languages are still seen as equal and as a result, interchangeable.

> This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.

Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.

  • Large scale infrastructure changes are often by nature completely untestable. The system is too large, there are too many moving parts to replicate with any kind of sane testing, so often, you do find out in prod, which is why robust and fast rollback procedures are usually desirable and implemented.

    • > Large scale infrastructure changes are often by nature completely untestable.

      You're changing the subject here and shifting focus from the specific to the vague. The two postmortems after the recent major Cloudflare outages both listed straightforward errors in source code that could have been tested and detected.

      Theoretical outages could theoretically have other causes, but these two specific outages had specific causes that we know.

      > which is why robust and fast rollback procedures are usually desirable and implemented.

      Yes, nobody is arguing against that. It's a red herring with regard to my point about source code testing.

      4 replies →

> This type of code error is prevented by languages with strong type systems.

True, as long as you don't call unwrap!

  • That's a different kind of error. And even then unwrap is opt-in whereas this is opt-out if you're lucky.

    Kind of funny that we get something showing the benefits of Rust so soon after everyone was ragging on a out unwrap anyway!

Apart from Cloudflare config system working too good to propagate failure modes:

the code quality on very mission critical path powering “half the internets” could’ve been better.

I’m not sure if Lua LSP / linting tools would’ve caught the issue (I also never used Lua myself), but tools and methods exist to test mission critical dynamically typed code.

The company with genuinely impressive concentration of talent was expected to think about fuzzing this legacy crap somehow.

As for `.unwrap()` related incident: normally code like this should never pass the review.

You just (almost) never unwrap in production code.

I’d start with code quality tooling but more important - the related processes before even thinking about the architecture changes.

Changing architecture in global sense which, in general has served for years with 99.99(9)% uptime is not obviously smart thing to do.

Architecture is doing great, it’s just impact which has been devastating because of the scale.

Everyone makes errors and it’s fine, but there are ways not to roll shit in prod (reference to famous meme pic where bugs do that).

There's a lot of bad karma in this discussion. It's hard to run large services. Careful when you set a precedent of pillorying after an outage. It could be you next!

Yes, this is the second time in a month. Were folks expecting that to have been enough time for them to have made sweeping technical and organization changes? I say no—this doesn't mean they aren't trying or haven't learned any lessons from the last outage. It's a bit too soon to say that.

I see this event primarily as another example of the #1 class of major outages: bad rapid global configuration change. (The last CloudFlare outage was too, but I'm not just talking about CloudFlare. Google has had many many such outages. There was an inexplicable multi-year gap between recognizing this and having a good, widely available staged config rollout system for teams to drop into their systems.) Stuff like DoS attack configurations needs to roll out globally quickly. But they really need make it not quite this quick. Imagine they deployed to one server for one minute, one region for one minute on success, then everywhere on success. Then this would have been a tiny blip rather than a huge deal.

(It can be a bit hard to define "success" when you're doing something like blocking bad requests that may even be a majority of traffic during a DDoS attack, but noticing 100% 5xx errors for 38% of your users due to a parsing bug is doable!)

As for the specific bug: meh. They should have had 100% branch coverage on something as critical (and likely small) as the parsing for this config. Arguably a statically typed language would have helped (but the `.unwrap()` error in the previous outage is a bit of a counterargument to that). But it just wouldn't have mattered that much if they caught it before global rollout.

Having their changes fully propagate within 1 minute is pretty fantastic.

  • This is most likely a strong requisite for such a big scale deployment if DDOS protection and detection - which explains their architectural choices (ClickHouse & co) and the need of a super low latency config changes.

    Since attackers might rotate IPs more frequently than once per minute, this effectively means that the whole fleet of servers should be able to quickly react depending on the decisions done centrally.

> As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?

  • I think the buffer size is the limit on what they check for malicious data, so the old 128k would mean it would be trivial to circumvent by just having 128k ok data and then put the exploit after.

    • I got curious and I checked AWS WAF. Apparently AWS WAF default limit for CloudFront is 16KB and max is 64KB.

  • If the request data is larger than the limit it doesn’t get processed by the Cloudflare system. By increasing buffer size they process (and therefore protect) more requests.

> Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.

Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.

Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.

But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).

> This first change was being rolled out using our gradual deployment system.

So they are aware of some basic mitigation tactics guarding against errors

> This system does not perform gradual rollouts,

They just choose to YOLO

> Typical actions are “block”, “log”, or “skip”. Another type of action is “execute”,

> However, we have never before applied a killswitch to a rule with an action of “execute”.

Do they do no testing? These isn't even fuzzing with “infinite” variations, but a limited list of actions

> existed undetected for many years. This type of code error is prevented by languages with strong type systems.

So this solution is also well known, just ignored for years, because "if it’s not broken, don’t fix it?", right?

First, what Cloudflare does is hard and I want to start with that.

That being said, I think it’s worth a discussion. How much of the last 3 outages were because of the JGC (the former CTO) retiring and Dane taking over?

Did JGC have a steady hand that’s missing? Or was it just time for outages that would have happened anyway?

Dane has maintained a culture of transparency which is fantastic, but did something get injected in the culture leading towards these issues? Will it become more or less stable since JGC left?

Curious for anyone with some insight or opinions.

(Also, if it wasn’t clear - huge Cloudflare fan and sending lots of good vibes to the team)

  • Looking at Dane's career history on LinkedIn, it appears that he has only ever been in product and some variety of manager, and his degree is in 'Engineering Management System'. It's an odd choice given that the previous two CTOs (Lee and John) were extremely technical and how core technology is to Cloudflare.

    As with any organisation where the CTO is not technical, there will be someone who the 'CTO' has to ask to understand technical situations. In my opinion, that person being asked is the real CTO, for any given situation.

I still don't understand what is cloudflare's business model, yet they manage to make news.

I don't see how their main product is ddos protection, yet cloudflare goes down for some reason.

This company makes zero sense to me.

  • Cloudflare protects against DDOS but also various forms of malicious traffic (bots, low reputation IP users, etc) and often with a DDOS or similar attacks, it's better to have the site go down from time to time than for the attackers to hammer the servers behind cloudflare and waste mass amounts of resources.

    i.e. it's the difference between "site goes down for a few hours every few months" and "an attacker slammed your site, and through in on-demand scaling or serverless component cloud fees blew your entire infrastructure budget for the year.

    Doubly so when your service is part of a larger platform and attacks on your service risk harming your reputation for the larger platform.

If I'm remembering correctly, there was another outage around 10 days ago.

It still surprises me that there are basically no free alternatives comparable to Cloudflare. Putting everything on CF creates a pretty serious single point of failure.

It's strange that in most industries you have at least two major players, like Coke vs. Pepsi or Nike vs. Adidas. But in the CDN/edge space, there doesn't seem to be a real free competitor that matches Cloudflare's feature set.

It feels very unhealthy for the ecosystem. Does anyone know why this is the case?

  • I reckon AWS is "free enough" for most of its users, but it's not as easy nor as safe for the common user.

    • Totally agree, AWS's free tier is great for many users, but it can definitely be tricky and risky for the average person.

Suggestion for Cloudflare: Create an early adopter option for free accounts.

Benefit: Earliest uptake of new features and security patches.

Drawback: Higher risk of outages.

I think this should be possible since they already differentiate between free, pro and enterprise accounts. I do not know how the routing for that works but I bet they could do this. Think crowd-sourced beta testers. Also a perk for anything PCI audit or FEDRAMP security prioritized over uptime.

  • I would for sure enable this, my personal server can handle being unreachable for a few hours in exchange for (potentially) interesting features.

  • They do in some way because the LaLiga blocking problems in Spain don’t affect the paid accounts=large websites.

    An other suggestion is to do it along night shift in every country, right now they only take into account EEUU night.

The interesting aspect of the Cloudflare support, which is not clarified, is how they came to the risk assessment that it is ok to roll out a change non-gradual globally without testing the procedure first. The only justification I can see is that the React/Next.js remote command execution vulnerabilities are actively exploited. But if this is the case they should say so.

I wonder anyone from internal could share the culture a bit. I'm mostly interested in the following part:

If someone messes up royally, is there someone who says "if you break the build/whatever super critical, then your ass is the grass and I'm the lawn mower"?

From a customer perspective, I think there should be an option:

- prioritize security: get patchs ASAP

- prioritize availability: get patchs after a cooldown period

Because ultimately, it's a tradeoff that cannot be handled by Cloudflare. It depends on your business, your threat model.

Not missing working with LUA in proxies. I think this is no big thing. They rolled back the change fairly quickly. Still bad but that outage mid November was worse since it was many bad decisions stacking up and it took too long time to resolve.

They bypassed the gradual rollout system in order to meet a deadline for a cve. They put security above availability, tough tradeoff. Is there a non prod environment where that one off waf testing tool change could have been tested?

Dang… I don’t even use React and it still brings down my sites. Good beats I guess.

The problem that irks me isn’t that Cloudflare is having outages (everyone does and will at some point, no matter how many 9’s your SLA states), it’s that the internet is so damn centralized that a Cloudflare issue can take out a continent-sized chunk of the internet. Kudos to them on their success story, but oh my god that’s way too many eggs in one basket in general.

I notice that this is the kind of thing that solid sociable tests ought to have caught. I am very curious how testable that code is (random procedural if-statements don't inspire high confidence.)

Curious if there isn't a way to ingest the incoming traffic at scale, but route it to a secondary infrastructure to make sure it's resolving correctly, before pushing it to production?

1.1.1.1 domain test server, whether a relay or endpoints including /cdn-cgi/trace is WAF testing error, for 500 HTTP network & Cloudflare managed R-W-X permissions

Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).

I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.

HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.

At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.

Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.

@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

  • > I truly believe they're really going to make resilience their #1 priority now

    I hope that was their #1 priority from the very start given the services they sell...

    Anyway, people always tend to overthink about those black-swan events. Yes, 2 happened in a quick succession, but what is the average frequency overall? Insignificant.

    • This is Cloudflare. They've repeatedly broken DNS for years.

      Looking across the errors, it points to some underlying practices: a lack of systems metaphors, modularity, testability, and an reliance on super-generic configuration instead of software with enforced semantics.

    • I think they have to strike a balance between being extremely fast (reacting to vulnerabilities and DDOS attacks) while still being resilient. I don't think it's an easy situation

  • I would very much like for him not to ignore the negativity, given that, you know, they are breaking the entire fucking Internet every time something like this happens.

    • This is the kind of comment I wish he would ignore.

      You can be angry - but that doesn't help anyone. They fucked up, yes, they admitted it and they provided plans on how to address that.

      I don't think they do these things on purpose. Of course given their good market penetration they end up disrupting a lot of customers - and they should focus on slow rollouts - but I also believe that in a DDOS protection system (or WAF) you don't want or have the luxury to wait for days until your rule is applied.

      5 replies →

  • > HugOps

    This childish nonsense needs to end.

    Ops are heavily rewarded because they're supposed to be responsible. If they're not then the associated rewards for it need to stop as well.

    • I have never seen an Ops team being rewarded for avoiding incidents (focusing in tech debt reduction), but instead they get the opposite - blamed when things go wrong.

      I think it's human nature (it's hard to realize something is going well until it breaks), but still has a very negative psychological effect. I can barely imagine the stress the team is going through right now.

      6 replies →

    • Ops has never been "rewarded" at any org I've ever been at or heard about, including physical infra companies.

Is it crazy to anyone else that they deploy every 5 minutes? And that it's not just config updates, but actual code changes with this "execute" action.

  • No: I've been at plenty of places where we get to continuous deployment, where any given change is deployed on demand.

    What is wild is that they are deploying without first testing in a staging environment.

  • Config updates are not so clear cut from code changes.

    Once I worked with a team in the anti-abuse space where the policy is that code deployments must happen over 5 days and config updates can take a few minutes. Then an engineer on the team argued that deploying new Python code doesn’t count as a code change because the CPython interpreter did not change; it didn’t even restart. And indeed given how dynamic Python is, it is totally possible to import new Python modules that did not exist when the interpreter process is launched.

A lot of these kind of bugs feel like they could be caught be a simple review bot like Greptile... I wonder if Cloudlare uses an equivalent tool internally?

  • What makes greptile a better choice compared to claude code or codex, in your opinion?

  • That has not been my experience with those tools.

    Super-procedural code in particular is too complex for humans to follow, much less AI.

Make faster websites:

> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.

Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.

When should we just give up on Cloudflare? Seems like this just keeps happening. Like some kind of backdoor triggered willy nilly, Hmmm?

  • Now. Right now. Seriously, stop using this terrible service. We also need to change the narrative that step 1 in every tutorial is "sign up for Cloudflare". This is partly a culture problem.

Is that me, or did CloudFlare outages increase since LLM "engineers" were hired remotely? Do you think there is a correlation?

  • They've always been flakey. At least these only impacted their own customers instead of taking down the internet.

before Cloudflare suffered an outage due to React's useEffect, now again trying to mitigate security issues around React Server Pages.

at one point in time - they've to admit this react thing ain't working. & just use classic server rendered pages, since their dashboards are simple toggle controls

As a reliability statistician (and web user!), I'd love to see Cloudflare investing in reliability statistics. :)

I have to wonder if there is a relation to the rising prevalence of coding LLMs.

Every time they screw up they write an elaborate postmortem and pat themselves on the back. Don't get me wrong, better have the postmortem than not. But at this point it seems like the only thing they are good at is writing incident postmortem blog posts.

I am not sure if it's just me or there have been too many outages this year to count. Is it the AI slop making into production?

Honestly a lot of these problems are because they don't test a staging environment, like isn't this software engineering basics?

Is it just me, or should they have just reverted instead of making _another_ change as a result of the first one?

ALSO, very very weird that they had not caught this seemingly obvious bug in proxy buffer size handling. This points to that the change nr 2, done in "reactive" mode to change nr 1 that broke shit, HAD NOT BEEN TESTED AT ALL! Which is the core reason they should never have deployed that, but rather revert to a known good state, then test BOTH changes combined.

Classic. Things always get worse before they get better. I remember when Netflix was going through their annus horribilis, and AWS before that, and Twitter before that, and so on. Everyone goes through this. Good luck to you guys getting to FL2 quickly enough that this class of error reduces.

I’m really sick of constantly seeing cloudflare, and their bullshit captchas. Please, look at how much grief they’re causing trying to be the gateway to the internet. Don’t give them this power

> This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.

Come on.

This PM raises more questions than it answers, such as why exactly China would have been immune.

It's not an outage, it's an Availability Incident™.

https://blog.cloudflare.com/5-december-2025-outage/#what-abo...

  • You jest, but recently I also felt compelled to stop using the word (planned) outage where I work, because it legitimately creates confusion around the (expected) character of impact.

    Outage is the nuclear wasteland situation, which given modern architectural choices, is rather challenging to manifest. To avoid it is face-saving, but also more correct.

  • From earlier in the very same blog post (emphasis added):

    > This system does not perform gradual rollouts, but rather propagates changes within seconds to the entire fleet of servers in our network and is under review following the outage we experienced on November 18.

Some nonsense again. The level of negligence there is astounding. This is frightening because this entity is daily exposed to a large portion of our personal data which goes over the wire. As well as business data. It’s just a matter of time before a disaster is going to occur. Some regulatory body must take control in their hands right now.

i wonder why they cannot partially rollout. like the other outage they have to do a global rollout.

  • I really don't see how it would've helped. In go or Rust you'd just get a panic, which is in no way different.

  • The article mentions that this Lua-based proxy is the old generation one, which is going to be replaced by the Rust based one (FL2) and that didn't fail on this scenario.

    So, if anything, their efforts towards a typed language were justified. They just didn't manage to migrate everything in time before this incident - which is ironically a good thing since this incident was cause mostly by a rushed change in response to an actively exploited vulnerability.

> Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.

Interesting.

  • They kinda buried the lede there, 28% failure rate for 100% of customers isn't the same as 100% failure rate for 28% of customers

Unwrap() strikes again