Comment by krick
5 days ago
It would be a good thing, if it would cause anything to change. It obviously won't. As if a single person reading this post wasn't aware that the Internet is centralized, and couldn't name specifically a few sources of centralization (Cloudflare, AWS, Gmail, Github). As if it's the first time this happens. As if after the last time AWS failed (or the one before that, or one before…) anybody stopped using AWS. As if anybody could viably stop using them.
If anything, centralisation shields companies using a hyperscaler from criticism. You’ll see downtime no matter where you host. If you self host and go down for a few hours, customers blame you. If you host on AWS and “the internet goes down”, then customers treat it akin to an act of God, like a natural disaster that affects everyone.
It’s not great being down for hours, but that will happen regardless. Most companies prefer the option that helps them avoid the ire of their customers.
Where it’s a bigger problem is when a critical industry like retail banking in a country all choose AWS. When AWS goes down all citizens lose access to their money. They can’t pay for groceries or transport. They’re stranded and starving, life grinds to a halt. But even then, this is not the bank’s problem because they’re not doing worse than their competitors. It’s something for the banking regulator and government to worry about. I’m not saying the bank shouldn’t worry about it, I’m saying in practice they don’t worry about it unless the regulator makes them worry.
I completely empathise with people frustrated with this status quo. It’s not great that we’ve normalised a few large outages a year. But for most companies, this is the rational thing to do. And barring a few critical industries like banking, it’s also rational for governments to not intervene.
> If anything, centralisation shields companies using a hyperscaler from criticism. You’ll see downtime no matter where you host. If you self host and go down for a few hours, customers blame you.
Not just customers. Your management take the same view. Using hyperscalers is great CYA. The same for any replacement of internally provided services with external ones from big names.
Exactly. No one got fired for using AWS. Advocating for self-hosting or a smaller provider means you get blamed when the inevitable downtime comes around.
I think this really depends on your industry.
If you cannot give a patient life saving dialysis because you don't have a backup generator then you are likely facing some liability. If you cannot give a patient life saving dialysis because your scheduling software is down because of a major outage at a third party and you have no local redundancy then you are in a similar situation. Obviously this depends on your jurisdiction and probably we are in different ones, but I feel confident that you want to live in a district where a hospital is reasonably responsible for such foreseeable disasters.
Yeah I mentioned banking because of what I was familiar with but medical industry is going to be similar.
But they do differ - it’s never ok for a hospital to be unable to dispense care. But it is somewhat ok for one bank to be down. We just assume that people have at least two bank accounts. The problem the banking regulator faces is that when AWS goes down, all banks go down simultaneously. Not terrible for any individual bank, but catastrophic for the country.
And now you see what a juicy target an AWS DC is for an adversary. They go down on their own now, but surely Russia or others are looking at this and thinking “damn, one missile at the right data Center and life in this country grinds to a halt”.
>If anything, centralisation shields companies using a hyperscaler from criticism. You’ll see downtime no matter where you host. If you self host and go down for a few hours, customers blame you.
What if you host on AWS and only you go down? How does hosting on AWS shield you from criticism?
This discussion is assuming that the outage is entirely out of your control because the underlying datacenter you relied on went down.
Outages because of bad code do happen and the criticism is fully on the company. They can be mitigated by better testing and quick rollbacks, which is good. But outages at the datacenter level - nothing you can do about that. You just wait until the datacenter is fixed.
This discussion started because companies are actually fine with this state of affairs. They are risking major outages but so are all their competitors so it’s fine actually. The juice isn’t worth the squeeze to them, unless an external entity like the banking regulator makes them care.
I’m pretty cloudflare centric. I didn’t start that way. I had services spread out for redundancy. It was a huge pain. Then bots got even more aggressive than usual. I asked why I kept doing this to myself and finally decided my time was worth recapturing.
Did everything become inaccessible the last outage? Yep. Weighed against the time it saves me throughout the year I call it a wash. No plans to move.
I'm of a similar mindset... yeah, it's inconvenient when "everything" goes down... but realistically so many things go down now and then, it just happens.
Could just as easily be my home's internet connection, or a service I need from/at work, etc. It's always going to be something, it's just more noticeable when it affects so many other things.
To be honest, it's MUCH easier to have one source to blame when things go down. If a small-medium vendor's website goes down on a normal day, so poor IT guy is going to be fielding calls all day.
If that same vendor goes down because Cloudflare went down, oh well. Most already know and won't bother to ask when your site will be back up
> It would be a good thing, if it would cause anything to change. It obviously won't.
I agree wholeheartedly. The only change is internal to these organizations (eg: CloudFlare, AWS) Improvements will be made to the relevant systems, and some teams internally will also audit for similar behavior, add tests, and fix some bugs.
However, nothing external will change. The cycle of pretending like you are going to implement multi-region fades after a week. And each company goes on continuing to leverage all these services to the Nth degree, waiting for the next outage.
Not advocating that organizations should/could do much, it's all pros/cons. But the collective blast radius is still impressive.
the root cause is customers refusing to punish these downtime.
Checkout how hard customers punish blackouts from the grid - both via wallet, but also via voting/gov't. It's why they are now more reliable.
So unless the backbone infrastructure gets the same flak, nothing is going to change. After all, any change is expensive, and the cost of that change needs to be worth it.
Is a little downtime such a bad thing? Trying to avoid some bumps and bruises in your business has diminishing returns.
16 replies →
> the root cause is customers refusing to punish these downtime.
ok how do I punish cloudflare -- build my own globally-distributed content-delivery network just for myself so that I can be "decentralized"?
Or should I go to one of their even-larger competitors like AWS or GCP?
What exactly do you propose?
3 replies →
Grid reliability depends on where you live. In some places, UPS or even a generator is a must have. So it's a bad example, I would say.
> Checkout how hard customers punish blackouts from the grid - both via wallet, but also via voting/gov't.
What? Since when has anyone ever been free to just up and stop paying for power from the grid? Are you going to pay $10,000 - $100,000 to have another power company install lines? Do you even have another power company in the area? State? Country? Do you even have permission for that to happen near your building? Any building?
The same is true for internet service, although personally I'd gladly pay $10,000 - $100,000 to have literally anything else at my location, but there are no proper other wired providers and I'll die before I ever install any sort of cellular router. Also this is a rented apartment so I'm fucked even if there were competition, although I plan to buy a house in a year or two.
1 reply →
Downtimes happen one way or another. The upside of using Cloudflare is that bringing things back online is their problem and not mine like when I self-host. :]
Their infrastructure went down for a pretty good reason (let the one who has never caused that kind of error cast the first stone) and was brought back within a reasonable time.
And even in multi-region, you experience a DNS failure and it all goes up in flames anyway. There's always going to be something.
It’s just a function of costs vs benefits. For most people, building redundancy at this layer costs far too much than the benefits.
If Cloudflare or AWS go down, the outage is usually so big that smaller players have an excuse and people accept that.
It’s as simple as that.
“Why isn’t your site working?” “Half the internet is down, here read this news article: …” “Oh, okay, let me know when it’s back!”
Same idea with the Crowdstrike bug, it seems like it didn't have much of on effect on their customers, certainly not with my company at least, and the stock quickly recovered, in fact doing very well. For me, it looks like nothing changed, no lessons learned.
what do you mean no lesson learned? seems like you haven't been paying attention..there's always a lesson learned
I believe they mean that Crowdstrike learned that they could screw up on this level and keep their customers....
2 replies →
With the rise in unfriendly bots on the internet as well as DDoS botnets reaching 15 Tbps, I don’t think many people have much of a choice.
The cynic in me wonders how much blame the world's leading vendor of DDoS prevention might share in the creation of that particularly problem
They provide free services to DDoS-for-hire services and do not terminate the services when reported.
2 replies →
> As if anybody could viably stop using them.
To be fair AWS (and GCP and Azure) at least is easy to replace with something else. And pretty much all alternatives are cheaper, less messy, etc. There are very few situations where you cannot viably do so.
We live in a world where you can get things like dedicated servers, etc. within similar time spans as creating a "compute engine" node on a big cloud provider.
The fact that cloud services added serious limitations to what applications were able to do (things like state management, passing configuration in more unified ways, etc.) means that running your own infrastructure is easier than ever, since your devs won't end up whining at you until you do something super custom just for some project to be a bit easier. But if you really want to you can.
GitHub also has become easy to get away from and indeed many individuals and companies did so.
CDNs are the bigger thing but A) there are a lot of other CDNs and B) having an image, or lets say an ansible config allows you to quickly deploy something that might be close enough for your use case. Just take any hosting company or even a dozen around the world.
Of course if you allowed yourself to end up in a complete vendor lock in things might be different, but if you think that it's a good idea to be completely dependent on the whims of some other company maybe you deserve that state. As in don't run a business without having any kind of fallback for decisions you make. Yes, profit from that big benefit something might give you, but don't lock the door behind you.
Sure you might be lucky and sure maybe you are fine going for luck while it lasts. Just don't be surprised when it all shatters.
> As if anybody could viably stop using them.
It is as easy to not use them as it ever was. There has been no actual centralisation. Everything is done using open protocols. I don't know what more you could want.
Compare it to Windows where there is deep volume discounting and salespeople shmoozing CTOs and getting in with schools, healthcare providers etc etc. That's actual lock-in.
When the tubes go down the tubes it will be the fault of those who are complacent.
This is just silliness.
It’s too few and far between. It’s gonna make some changes if it’s a monthly event. If businesses start to lose connection for 8 hours every month, maybe the bigger ones are going to run for self hosting or at least some capacity of self hosting.
Yeah, agree. But even in case of 8 hour downtime (it's almost 99% SLA) it isn't beneficial for really small firms.
> It obviously won't.
Here's where we separate the men from the boys, the women from the girls, the Enbys from the enbetts, and the SREs from the DevOps. If you went down when Cloudflare went do, do you go multicloud so that can't happen again, or do you shrug your shoulders and say "well, everyone else is down"? Have some pride in your work, do better, be better, and strive for greatness. Have backup plans for your backup plans, and get out of the pit of mediocrity.
Or not, shit's expensive and kubernetes is too complicated and "no one" needs that.
You make the appropriate cost/benefit decision for your business and ignore apathy on one side and dogma on the other.
> As if anybody could viably stop using them.
You can, and even save money.
Same with the big Crowdstrike fail of 2024. Especially when everyone kept repeating the laughable statement that these guys have their shit in order, so it couldn't possibly be a simple fuckup on their end. Guess what, they don't, and it was. And nobody has realized the importance of diversity for resilience, so all the major stuff is still running on Windows and using Crowdstrike.
I wrote https://johannes.truschnigg.info/writing/2024-07-impending_g... in response to the CrowdStrike fallout, and was tempted to repost it for the recent CloudFlare whoopsie. It's just too bad that publishing rants won't change the darned status quo! :')
People will not do anything until something really disastrous happens. Even afterwards memories can fade. Cloudstrike has not lost many customers.
Covid is a good parallel. A pandemic was always possible, there is always a reasonable chance of one over the course of decades. However people did not take it seriously until it actually happened.
A lot of Asian countries are a lot better prepared for a tsunami then they were before 2004.
The UK was supposed to have emergency plans for a pandemic, but it was for a flu variant, and I suspect even those plans were under-resourced and not fit for purpose. We are supposed to have plans for a solar storm but when another Carrington even occurs I very much doubt we will deal with it smoothly.