← Back to context

Comment by lordofgibbons

4 hours ago

How did we get to a place where either Cloudflare or AWS having an outage means a large part of the web going down? This centralization is very worrying.

Because no one cares enough, including users.

Oddly this centralization allows a complete deferral of blame without you even doing anything: if you’re down, that’s bad. But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.

It also reduces your incentive to change, if “the internet is down” people will put down their device and do something else. Even if your web site is up they’ll assume it isn’t.

I’m not saying this is a good thing but I’m simply being realistic about why we ended up where we are.

  • As a user I do care, because I waste so much time on Cloudflare's "prove you are human" blocking-page (why do I have to prove it over and over again?), and frequently run on websites blocking me entirely based on some bad IP-blacklist used along with Cloudflare.

    • Unfortunately the internet sucks in 2025.

      If you have a site with valuable content the LLM crawlers hound you to no end. CF is basically a protection racket at this point for many sites. It doesnt even stop the more determined ones but it keeps some away.

      7 replies →

    • I just realized, why don't they have some "definitely human" third party cookie that caches your humanness for 24h or so? I'm sure there's a reason, I've heard third party cookies were less respected now, but can someone chime in on why this doesn't work and save a ton of compute?

      7 replies →

    • I hate it as much (and the challenge time seems to be getting longer, 10s lately for me, what the hell?)

      But we can all say thank you to all the AI crawlers who hammer websites with impossible traffic.

      1 reply →

  • Users have no options because... everything has been centralized. So it doesn't matter if users care or not.

    Users are never a consideration today anyway.

    • There absolutely are options but we aren't using them because nobody cares enough about these downsides. bsky is up, with Mastodon you even have choice between tons of servers and setting up your own. Yet, nobody cares enough about the occasional outage to switch. It's such a minor inconvenience that it won't move the needle one bit. If people actually cared, businesses would lose customers and correct the issue.

    • It is a trade-off between convenience and freedom. Netflix vs buying your movies. Spotify vs mp3s. Most tech products have alternatives. But you need to be flexible and adjust your expectations. Most people are not willing to do that

      3 replies →

  • 100% this. While in my professional capacity I'm all in for reliability and redundancy, as an individual, I quite like these situations when it's obvious that I won't be getting any work done and it's out of my control, so I can go run some errands to or read a book, or just finish early.

  • Who cares if a couple of websites are down a day or even two?

    As long as HN is up and running, everything is going to be O.K.!

    • Wealthy, investment-bloated software companies will be fine.

      Smaller companies that provide real world services or goods to make a much more meagre living that rely on some of the services sold to them by said software companies will be impacted much more greatly.

      Losing a day or two of sales to someone who relies on making sales every day can be a growing hardship.

      This doesn’t just impact developers. It’s exactly this kind of myopic thinking that leads to scenarios like mass outages.

      3 replies →

  • > But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.

    In my direct experience, this isn't true if you're running something even vaguely mission-critical for your customers. Your customer's workers just know that they can't do their job for the day, and your customer's management just knows that the solution they shepherded through their organization is failing.

    • It's really quite funny, many of the ACTUALLY vital systems to running the world as we know it are running off of very different softwares. Cloudflare appears to have a much higher % of non vital systems running on it than say something like akamai.

      If akamai went down i have a feeling you'd see a whole lot more real life chaos.

  • > if “the internet is down” people will put down their device and do something else

    In this case, the internet should be down more often.

    • If the internet is down is what it takes to get you to put it down once in a while, I think thats probably the problem.

  • Which "user" are you referring to? Cloudflare users or end product users?

    End product users have no power, they can complain to support and maybe get a free month of service, but the 0.1% of customers that do that aren't going to turn the tide and have anything change.

    Engineering teams using these services also get "covered" by them - they can finger point and say "everyone else was down too."

  • Many people care, but none of them can (sufficiently) change the underlying incentive structure to effect the necessary changes.

  • This is essentially the entire IT excuse for going to anything cloud. I see IT engineers all the time justifying that the downtime stops being their problem and they stop being to blame for it. There's zero personal responsibility in trying to preserve service, because it isn't "their problem" anymore. Anyone who thinks the cloud makes service more reliable is absolutely kidding themselves, because everyone who made the decision to go that way already knows it isn't true, it just won't be their problem to fix it.

    If anyone in the industry actually cared about reliability and took personal stake in their system being up, everyone would be back on-prem.

    • Reliability is not even how the cloud got sold to the C Suite. Good God, when my last company started putting things on Azure back in 2015 stuff would break weekly, usually on Monday mornings.

      No, the value proposition was always about saving money, turning CapEx into OpEx. Direct quote from my former CEO maybe 9 years ago: We are getting out of the business of buying servers.

      Cloud engineering involves architecting for unexpected events: retry patterns, availability zones, multi-region fail over, that sort of thing.

      Now - does it all add up to cost savings? I could not tell you. I have seen some case studies, but I also have been around long enough to take those with a big grain of salt.

      11 replies →

    • I mean in the end it's about making a trade off that makes sense for your business.

      If the business can live with a couple of hours downtime per year when "cloud" is down, and they think they can ship faster / have less crew / (insert perceived benefit), then I don't know why that is a problem.

  • More like "don't have choice". It's not like service provider gonna go to competition, because before you switch, it will be back.

    Frankly it's a blessing, always being able to blame the cloud that management forced company to migrate to be "cheaper" (which half of the time turns out to be false anyway)

  • > It also reduces your incentive to change, if “the internet is down” people will put down their device and do something else. Even if your web site is up they’ll assume it isn’t.

    I agree. When people talk about the enshittification of the internet, Cloudflare plays a significant role.

  • > Because no one cares enough, including users.

    When have users been asked about anything?

  • But Spotify was not down. One social media was down.

    This:

    > if you’re down, that’s bad. But if you’re down, Spotify is down, social media is down… then “the internet is broken” and you don’t look so bad.

    is just marketing. If you are down with some other websites it is still bad.

    • Admittedly when I wrote that I was thinking about the recent AWS outage. Anecdotally, I asked friends and family about their experience and they assumed the internet was down. Almost everything at my work runs on Google cloud so we were still running but we observed a notable dip in traffic during the outage all the same.

      > it is still bad

      No doubt. But there’s a calculation to make, is it bad enough to spend the extra money on mitigations, to hire extra devops folks to manage it all… and in the majority of end user facing cases the answer is no, it isn’t.

      1 reply →

    • > If you are down with some other websites it is still bad.

      In some cases, absolutely. For the vast majority, it really, really doesn't matter.

      (Source: my personal website is down and nobody cares, including me)

  • > Because no one cares enough, including users.

    this is like a bad motivational speaker talk.. heavy exhortations with a dramatic lack of actual reasoning.

    Systems are difficult, people. It is "incentives" of parties and lockin by tech design and vendors, not lack of individual effort.

  • Eh? It's because they are offering a service too good to refuse.

    The internet this day is fucking dangerous and murderous as hell. We need Cloudflare just to keep services up due to the deluge of AI data scrapers and other garbage.

Many reasons but DDoS protection has massive network effects. The more customers you have (and therefore bandwidth provision) the easier it is to hold up against a DDoS, as DDoS are targeting just one (usually) customer.

So there are massive economies of scale. Small CDN with (say) 10,000 customers and 10mbit/sec per customer can handle 100gbit/s DDoS (way too simplistic, but hopefully you get the idea) - way too small.

If you have the same traffic provisioned on average per customer and have 1 million customers, you can handle a DDoS 100x the size.

Only way to compete with this is to massively overprovision bandwidth per customer (which is expensive, as those customers won't pay more just for you to have more redundancy because you are smaller).

In a way (like many things in infrastructure) CDNs are natural monopolies. The bigger you get -> the more bandwidth and PoP you can have -> more attractive to more customers (this repeats over and over).

It was probably very astute of Cloudflare to realise that offering such a generous free plan was a key step in this.

  • Your argument is technically flawed.

    In a CDN, customers consume bandwidth; they do not contribute it. If Cloudflare adds 1 million free customers, they do not magically acquire 1 million extra pipes to the internet backbone. They acquire 1 million new liabilities that require more infrastructure investment.

    All you are doing is echoing their pitch book. Of course they want to skim their share of the pie.

    • > In a CDN, customers consume bandwidth; they do not contribute it

      They contribute money which buys infrastructure.

      > If Cloudflare adds 1 million free customers,

      Is the free tier really customers? Regardless most of them are small that it doesn't cost cloudflare much anyways. The infrastructure is already there anyways. Its worth it to them for the good will it generates which leads to future paying customers. It probably also gives them visibility into what is good vs bad traffic.

      1 million small sites could very well cost less to cloudflare than 1 big site.

    • You're missing the economies of scale part.

      OP is saying it's cheaper overall for a 10 million customer company to add infrastructure for 1 million more than it is for a 10,000 customer company to add infrastructure for 1000 more people.

      If you're looking at this as a "share of the pie", it's probably not going to make sense. The industry is not zero sum.

    • I imagine every single customer is provisioned based on some peak expected typical traffic and that's what they base their capital investment in bandwidth on.

      However most customers are rarely at their peak, this gives you tremendous spare capacity to use to eat DDoS attacks, assuming that the attacks are uncorrelated. This gives you huge amounts of capacity that's frequently doing nothing. Cloudflare advertise this spare capacity as "DDoS protection."

      I suppose in theory it might be possible to massively optimise utilisation of your links, but that would be at the cost of DDoS protection and might not improve your margin very meaningfully, especially is customers care a lot about being online.

  • And how many companies want to also be able to build out their own CDN?

    Not every company can be an expert at everything.

    But perhaps many of us could buy a different CDN than the major players if we want to reduce the likelihood of mass outages like this though.

  • In my opinion, DDoS is possible only because there is no network protocol for a host to control traffic filtering on upstream providers (deny traffic from certain subnets or countries). In this case everybody would prefer write their own systems rather than rely on a harmful monopoly.

    • The recent Azure DDoS used 500k botnet IPs. These will have been widely distributed across subnets and countries, so your blocking approach would not have been an effective mitigation.

      Identifying and dynamically blocking the 500k offending IPs would certainly be possible technically -- 500k /32s is not a hard filtering problem -- but I seriously question the operational ability of internet providers to perform such granular blocking in real-time against dynamic targets.

      I also have concerns that automated blocking protocols would be widely abused by bad actors who are able to engineer their way into the network at a carrier level (i.e. certain governments).

      3 replies →

    • What traffic would you request the upstream providers to block if getting hit by Aisuru? Considering the botnet consists of residential routers, those are the same networks your users will be originating from. Sure, in best case, if your site is very regional, you can just block all traffic outside your country - but most services don't have this luxury.

      Blocking individual IP addresses? Sure, but consider that before your service detects enough anomalous traffic from one particular IP and is able to send the request to block upstream, your service will already be down from the aggregate traffic. Even a "slow" ddos with <10 packets per second from one source is enough to saturate your 10Gbps link if the attacker has a million machines to originate traffic from.

      6 replies →

    • > here is no network protocol for a host to control traffic filtering on upstream providers (deny traffic from certain subnets or countries).

      There is no network protocol per se, but there is commercial solutions like fortinet that can block countries iirc, but to note that it's only ip range based so it's not worth a lot

      3 replies →

Yeah, I went to HN after the third web page didn't work. I am not just worried about the single point of failure, I am much more worried about this centralization eventually shaping the future standards of the web and making it de facto impossible to self-host anything.

Well that and the fact that when 99% goes through a central party, then that central party will be very interesting for authoritarian governments to apply sweeping censorship rules to.

  • It is already nearly impossible/very expensive in my country to be able to get a public IP address (Even IPv6) which you could host on. World is heavily moving towards centrally dependant on these big Cloud providers.

    • What part of the world has any ipv6 limitations? In the USA An ISP will give you a /48 from their /32 if you have any colo arrangement without even a blink. That gives you 2^16 networks with essentially infinite number of hosts on each network. Zero additional charge.

  • > eventually shaping the future standards of the web and making it de facto impossible to self-host anything

    Eventually?

Another one that worries me is Let's Encrypt.

It is not as bad as Cloudflare or AWS because certificates will not expire the instant there is an outage, but considers that:

- It serves about 2/3 of all websites

- TLS is becoming more and more critical over time. If certificates fail, the web may as well be down

- Certificate lifetimes are becoming shorter and shorter, now 90 days, but Let's Encrypt is now considering 6 days, with 47 days being planned as a minimum

- An outage is one thing, but should a compromise happen, that would be even more catastrophic

Let's Encrypt is a good guy now, but remember that Google used to be a good guy in the 2000s too!

  • Agree, I’ve thought about this one too. The history of SSL/TLS certs is pretty hacky anyway in my opinion. The main problem they are solving really should have been solved at the network layer with ubiquitous IPsec and key distribution via DNS since most users just blindly trust whatever root CAs ship with their browser or OS, and the ecosystem has been full of implementation and operational issues.

    Let’s Encrypt is great at making the existing system less painful, and there are a few alternatives like ZeroSSL, but all of this automation is basically a pile of workarounds on top of a fundamentally inappropriate design.

    • There's not really a way around the initial trust problem with consumer oriented certs though. Yours could reduce the number of initially trusted down to one I think but not any further.

  • Google was always a for-profit operation. Let's Encrypt/ISRG could still go rotten but there are less incentives for them to do so as a non-profit.

Mostly since the AWS craze started a decade ago, developers have gone away from Dedicated servers (which are actually cheaper, go figure), which is causing all this mess.

It's genuinely insane that many companies are designing a great amount of fallbacks... on the software level but almost none is thought on the hardware/infrastructure level, common-sense dictate that you should never host everything on a single provider.

  • I tried as hard as I could to stay self hosted (and my backend is, still), but getting constant DDoS attacks and not having the time to deal with fighting them 2-3x a month was what ultimately forced me to Cloudflare. It's still worse than before even with their layers of protection, and now I get to watch my site be down a while, with no ability to switch DNS to point back to my own proxy layer, since CF is down :/

    • This is wild. Was your website somehow controversial? Ive been running many different websites for over 30+ years now, and have never been the target of a DDOS. The closest I’ve seen was when one website had a blind time based sql injection vulnerability and the attacker was abusing it, all the SLEEP() injected into the database brought the server to a crawl. But that’s just one attacker from a handful of IPs, hardly what i would call a DDOS.

      5 replies →

  • With the state of constant attack from AI scrapers and DDOS bots, you pretty much need to have a CDN from someone now, if you have a serious business service. The poor guys with single prem boxes with static HTML can /maybe/ weather some of this storm alone but not everything.

  • I self hosted on one of the company’s servers back in the late 90s. Hard drive crashes (and a hack once, through an Apache bug) had our services (http, pop, smtp, nfs, smb, etc ) down for at least 2-3 days (full reinstall, reconfiguration, etc).

    Then, with regular VPSs I also had systems down for 1-2 days. Just last week the company that hosts NextCloud for us was down the whole weekend (from Friday evening) and we couldn’t get their attention until Monday.

    So far these huge outages that last 2-5 hours are still lower impact for me, and require me to take less action.

    • Solving issue for a few, and making issues for millions, including perhaps the few. It is easier to sleep at nights though, for a few.

  • I like the idea of having my own rack in a data center somewhere (or sharing the rack, whatever) but even a tiny cost is still more than free. And even then, that data center will also have outages, with none of the benefits of a Cloudflare Pages, GitHub Pages, etc.

  • > developers have gone away from Dedicated servers (which are actually cheaper, go figure)

    It depends on how you calculate your cost. If you only include the physical infrastructure having a dedicated server is cheaper. But by having some dedicated server you loose a lot of flexibility. Needs more resources? Just scale up your ec2, and with a dedicated server there is a lot more work involved.

    Do you want a 'production-ready' database? With AWS you can just click a few buttons and have a RDS ready to use. To roll out your own PG installation you need someone with a lot of knowledge(how to configure replication? backups? updates? ...).

    So if you include salaries in the calculation the result changes a lot. And even if you already have some experts in your payroll by putting them to work in deploying a PG instance you won't be able to use them to build other things that may generate more value to you business than the premium you pay to AWS.

  • Cloud-Hoster are that hardware-fallback. They started with offering better redundancy and scaling than your homemade breadbox. But it seems they lost something along the way and now we have this.

  • Maintainance cost is the main issue for on-prem infra, nowadays add things like DDOS protection and/or scraping protection, which can require dedicated team or for your company to rely on some library or open source project that is not guaranteed to be maintained forever (unless you give them support, which i believe in)... Yeah I can understand why companies shift off of on-prem nowadays

  • ... dedis are cheaper if you are rightsized. If you are wrongsize they just plain crash and you may or may not be able to afford the upgrade.

    I was at Softlayer before I was at AWS and what catalyzed the move was the time I needed to add another hard drive to a system and somehow they screwed it up. I couldn't put a trouble ticket it to get it fixed because my database record in their trouble ticket system was corrupted. The next day I moved my stuff to AWS and the day after that they had a top sales guy talk to me to try to get me to stay but it was too late.

They're using cloudfare for multicloud, but still have cloudfare as a single point of failure. Should make a cloudfare for cloudfare to solve this.

  • Like the infamous "smiling through the pain" meme:

    "I added a load-balancer to improve system reliability" (happy)

    "Load balancer crashed" (smiling-through-the-pain)

    • Reliability have very weird curve frankly.

      Technically, multi-node cluster with failover (or full on active-active) will have far higher uptime than just a single node.

      Practically, to get the multi-node cluster (for any non trivial workload) to work right, reliably, fail-over in every case etc. is far more work, far more code (that can have more bugs), and even if you do everything right and test what you can, unexpected stuff can still kill it. Like recently we had uncorrectable memory error which just happened to hit the ceph daemon just right that one of the OSDs misbehaved and bogged down entire cluster...

  • You jest, but this actually does exist. Multiple CDNs sell multi-CDN load balancing (divide traffic between 2+ CDNs per variously-complicated specifications, with failover) as a value add feature, and IIRC there is at least one company for which this is the marquee feature. It's also relatively doable in-house as these things go.

  • Failover to Akamai.

    • As someone who has worked for a CDN for over a decade, this is what most big customers do. Under normal circumstances, they send portions of traffic to different CDNs, usually based on cost (and or performance in various regions). When an issue happens, they will pull traffic from the problem CDN.

      Of course, if a big incident happens for a big CDN, there might not be enough latent capacity in the other CDNs to take all the traffic. CDNs are a cutthroat business, with small margins, so there usually isn’t a TON of unused capacity laying around.

  • If there’s clearly a single point of failure shouldn’t it be called a single cloud pretending to be “multicloud”?

This might sound crazy as a software engineer, but I actually like the occasional "snow day" where everything goes down. It's healthy for us to all disconnect from the internet for a bit. The centralization unintentionally helps facilitate that. At least, that's my glass half full perspective.

  • I can understand that sentiment. Just don't lose sight of the impact it can have on every day people. My wife and I own a small theatre and we sell tickets through Eventbrite. It's not my full time job but it is hers. Eventbrite sent out an email this morning letting us know that they are impacted by the outage. Our event page appears to be working but I do wonder if it's impacting ticket sales for this weekend's shows.

    So while us in tech might like a "snow day", there are millions of small businesses and people trying to go about their day to day lives who get cut off because of someone else's fuck-ups when this happens.

    • Absolutely solid point; there are a couple of apps I use daily for productivity, chores, even for alarm scheduling, that with the free versions, the ads wouldn’t load so I couldn’t use them but some of them were updated already. Made me realize I forgot that we’re kind of like cyborgs relying on technology that’s integrated so deeply into our lives that all it takes is an EMP blast like a monopolistic service going down to bring -us- down until we take a breath and learn how to walk again. Wild time.

  • If the internet was just social media, SaaS productivity suites, and AI slop, sure...

    But there are systems that depend on Cloudflare, directly or not, and when they go down it can have a serious impact on somebody's livelihood.

  • I'm guessing you're employed and your salary is guaranteed regardless. Would you have the same outlook if you were the self-employed founder of an online business and every minute of outage was costing you money?

    • What are you paying in order to be down?

      Even if you were making a million a minute, typically, it still didn't cost you a thing, nor have you lost anything.

      You're not making as much, sure, but neither a cost, nor a loss.

      2 replies →

It's not only centralization in the sense your website will be down if they are down but it is also a centralized MITM proxy. If you transfer sensitive data like chats over cloudflare-"protected" endpoints, you also allow CF to transparently read and analyze it in plain-text. It must be very easy for state agencies to spy on the internet nowadays, they woukd just ask CF to redirect traffic to them.

How did we get to a place where Cloudflare being down means we see an outage page, but on that page it tells us explicitly that the host we're trying to connect to is up, and it's just a Cloudflare problem.

If it can tell us that the host is up, surely it can just bypass itself to route traffic.

  • "... surely it can just ..."

    Congratulations, you've successfully completed Management Training 101.

Now that network effects and data lock-in have taken root, downtime is not as big of a concern as it was in the 2000s

  • What does this even mean? Because people have locked in their data, they’re ok with downtime? I can’t imagine a world where this is true.

    • It costs a lot of money to move, you don't know if the alternative will be any better, and if it affects a lot of companies then it's nobody's fault. "Nobody ever got fired for buying Cloudflare/AWS" as they say.

    • it's not just that, it's the creation of a sorta status symbol, or at least of symbol of normality.

      there was a point (maybe still) where not having a netflix subscription was seen as 'strange'.

      if that's the case in your social circles -- and these kind of social things bother you -- you're not going to cancel the subscription due to bad service until it becomes a socially accepted norm.

    • It's just that customers are more understanding when they see their Netflix not working either otherwise they just think you're less professional. Try talking to customers after an outage and you will see.

  • except, yknow, where peoples lives and livelihoods depend on access to information/being able to do things on exact time. aws and cloudflare are disqualifying themselves from hospitals and military and whatnot.

    • For example, Cloudflare employees make money on promises to mitigate such attacks, but then can’t guarantee they will, and take all their customers down at once. It’s a shared pain model.

Because DDoS is a fact of life (and even if you aren't targeted by DDoS, the bot traffic probing you to see if you can be made part of the botnet is enough to take down a cheap $5 VPS). So we have to ask - why? Personally, I don't accept the hand-wavy explanation that botnets are "just a bunch of hacked IoT devices". No, your smart lightbulb isn't taking down Reddit. I slightly believe the secondary explanation that it's a bunch of hacked home routers. We know that home routers are full of things like suspicious oopsie definitely-not-government backdoors.

Because it's better to have a really convenient and cheap service that works 99% of the time, than a resilient that is more expensive or more cumbersome to use.

It's like github vs whatever else you can do with git that is truly decentralized. The centralization has such massive benefits that I'm very happy to pay the price of "when it's down I can't work".

Most developers don't care to know how the underlying infrastructure works (or why) and so they take whatever the public consensus is re: infra as a statement of fact (for the better part of the last 15 years or so that was "just use the cloud"). A shocking amount of technical decisions are socially, not technically enforced.

Because bots are a real thing.

And it’s hard to protect against DDoS without something like Cloudflare.

Look at the posts here.

Even the meager HN “hug of death” will take things down

A lot of products use AWS because "we could build redundancy and multi-region if we need it" and then never build it.

  • I think some of the issues in the last outage actually affected multiple regions. IIRC internally some critical infrastructure for AWS depends on us-east-1 or at least it failed in a way that didn't allow failover.

I would be less worried if Cloudflare and AWS weren't involved in many more things than simply running DNS.

AWS - someone touches DynamoDB and it kills the DNS.

Cloudflare - someone touches functionality completely unrelated to DNS hosting and proxying and, naturally, it kills the DNS.

There is this critical infrastructure that just becomes one small part of a wider product offering, worked on by many hands, and this critical infrastructure gets taken down by what is essentially a side-effect.

It's a strong argument to move to providers that just do one thing and do it well.

The same reason we have centralization across the economy. Economies of scale is how you make a big business succesful, and once you are on top its hard to dislodge you.

This topic is raised every time there is an outage with cloudflare and the truth of the matter is, they offer an incredible service, there is not a bit enough competition to deal with it. By definition their services are so good BECAUSE their adoption rate is so high.

It's very frustrating of course, and it's the nature of the beast.

IMO, centralization is inevitable because the fundamental forces drive things in that direction. Clouds are useful for a variety of reasons (technical, time to market, economic), so developers want to use them. But clouds are expensive to build and operate, so there are only a few organizations with the budget and competency to do it well. So, as the market matures you end up with 3 to 5 major cloud operators per region, with another handful of smaller specialists. And that’s just the way it works. Fighting against that is to completely swim upstream with every market force in opposition.

This was always the case. There was always a "us-east" in some capacity, under Equinix, etc. Except it used to be the only "zone," which is why the internet is still so brittle despite having multiple zones. People need to build out support for different zones. Old habits die hard, I guess.

Compliance. If you wanna sell your SAAS to big corpo, their compliance teams will feel you know what you're doing if they read AWS or Cloudflare on your architecture, even if you do not quite know what you're doing.

Well the centralisation without rapid recovery and practices that provide substantial resiliency… that would be worrying.

But I dare say the folks at these organisations take these matters incredibly seriously and the centralisation problem is largely one of risk efficiency.

I think there is no excuse, however, to not have multi region on state, and pilot light architectures just in case.

> How did we get to a place where either Cloudflare or AWS having an outage means a large part of the web going down?

As always, in the name of "security". When are we going to learn that anything done, either by the government or by a corporation, in the name of security is always bad for the average person?

Because they are great services, are generally pretty easy to get started with, and usually work as expected, which has led to broad adoption.

Currently at the public library and I can't use the customer inventory terminals to search for books. They're just a web browser interface to the public facing website, and it's hosted behind CF. Bananas.

Except businesses love it.

A lot (and I mean a lot) of people in IT like centralization specifically because it’s hard to blame people for doing something that everyone else is doing.

  • And HN users love it too. I've had people on this site say how great it is that their system routes 30% of traffic on the internet.

    I'd be horrified. That's not the internet or computing industries I grew up with, or started working in.

    But as long as the SPY keeps hitting > 10% returns each year, everyone's happy.

  • "No one gets fired for buying IBM!"

    • "No one gets fired for buying Microsoft" "No one gets fired for buying AWS" "No one gets fired for buying Cloudflare"

      Perhaps the most graceful death of a tech company is that sentiment? Before some perception shift?

We take the idea of the internet always being on for granted. Most people don’t understand the stack and assume that when sites go down it’s isolated, and although I agree with you, it’s just as much complacency and lack of oversight and enforcement delays in bureaucracy as it is centralization. But I guess that’s kind of the umbrella to those things… lol

Don't forget the CloudStrike outage: One company had a bug that brought down almost everything. Who would have thought there are so many single points of failure across the entire Internet.

It's because single points of traffic concentration are the most surveillable architecture, so FVEY et al economically reward with one hand those companies who would build the architecture they want to surveil with the other hand.

Don't think there is anything wrong with a centralised service being down, you just make a conscious decision if you want that and can afford that?

People not being ready for cloudflare/[insert hyperscaler] to be possibly down is the only fault.

The technical term for it is a man in the middle. It’s better to call it what it is that way you aren’t fooled into thinking it’s not, because it is.

And all of these outages happening not long after most of them dismissed a large amount of experienced staff while moving jobs offshore to save in labor costs.

People use CloudFlare because it's a "free" way for most sites to not get exploited (WAF) or DDoSed (CDN/proxy) regularly. A DDoS can cost quite a bit more than a day of downtime, even just a thundering herd of legitimate users can explode an egress bill.

It sucks there's not more competition in this space but CloudFlare isn't widely used for no reason.

AWS also solves real problems people have. Maintaining infrastructure is expensive as is hardware service and maintenance. Redundancy is even harder and more expensive. You can run a fairly inexpensive and performant system on AWS for years for the cost of a single co-located server.

because efficiency trumps redundancy in the short term, which is all that matters in a super competitive environment.

When there is an accident on the interstate we should blame the centralization of traffic and advocate for no more highways.

Very worrying indeed.

Is avoiding single point of failure in anyone’s playbook? ¯\_(ツ)_/¯

  • We only care about it when it's time to complain about the work of individual people.

    Companies can always do as they please and people will rationalize anything.

Re: Cloudflare it is because developers actively pushed "just use Cloudflare" again and again and again.

It has been dead to me since the SSL cache vulnerability thing and the arrogance with which senior people expected others to solve their problems.

But consider how many people still do stupid things like use the default CDN offered by some third party library, or use google fonts directly; people are lazy and don't care.

It's not really. People are just very bad at putting the things around them into perspective.

Your power is provided by a power utility company. They usually serve an entire state, if not more than one (there are smaller ones too). That's "centralization" in that it's one company, and if they "go down", so do a lot of businesses. But actually it's not "centralized", in that 1) there are actually many different companies across the country/world, and 2) each company "decentralizes" most of its infrastructure to prevent massive outages.

And yes, power utilities have outages. But usually they are limited in scope and short-lived. They're so limited that most people don't notice when they happen, unless it's a giant weather system. Then if it's a (rare) large enough impact, people will say "we need to reform the power grid!". But later when they've calmed down, they realize that would be difficult to do without making things worse, and this event isn't common.

Large internet service providers like AWS, Cloudflare, etc, are basically internet utilities. Yes they are large, like power utilities. Yes they have outages, like power utilities. But the fact that a lot of the country uses them, isn't any worse than a lot of the country using a particular power company. And unlike the power companies, we're not really that dependent on internet service providers. You can't really change your power company; you can change an internet service provider.

Power didn't used to be as reliable as it is. Everything we have is incredibly new and modern. And as time has passed, we have learned how to deal with failures. Safety and reliability has increased throughout critical industries as we have learned to adapt to failures. But that doesn't mean there won't be failures, or that we can avoid them all.

We also have the freedom to architect our technology to work around outages. All the outages you have heard about recently could be worked around, if the people who built on them had tried:

- CDN goes down? Most people don't absolutely need a CDN. Point your DNS at your origins until the CDN comes back. (And obviously, your DNS provider shouldn't be the same as your CDN...)

- The control plane goes down on dynamic cloud APIs? Enable a "limp mode" that persists existing infrastructure to serve your core needs. You should be able to service most (if not all) of your business needs without constantly calling a control plane.

- An AZ or region goes down? Use your disaster recovery plan: deploy infrastructure-as-code into another region or AZ. Destroy it when the az/region comes back.

...and all of that just to avoid a few hours of downtime per year? It's likely cheaper to just take the downtime. But that doesn't stop people from piling on when things go wrong, questioning whether the existence of a utility is a good idea.

because cloudfare protection blah blah, until cloudfare is down itself and then you are back to "who watches the watchmen"

Hacking software or hardware is so old school.

The target these days is the user.

The make-believe worm.

5 mins. of thought to figure out why these services exist?

Dialogue about mitigations/solutions? Alternative services? High availability strategies?

Nah! It's free to complain.

Me personally, I'd say those companies do a phenomenal job by being a de facto backbone of the modern web. Also Cloudflare, in particular, gives me a lot of things for free.