Comment by rvnx
2 days ago
Absolutely possible. Though there is something curious:
https://www.cloudflarestatus.com/
At Cloudflare it started with: "Investigating - Cloudflare engineering is investigating an issue causing Access authentication to fail.".
So this would somehow validate the theory of auth/quotas started failing right after Google, but what happened after ?! Pure snowballing ? That sounds a bit crazy.
From the Cloudflare incident:
> Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable [...]
Surprising, but not entirely unplausible for a GCP outage to spread to CF.
> outage of a 3rd party service that is a key dependency.
Good to know that Cloudflare has services seemingly based on GCP with no redundancy.
Probably unintentional. "We just read this config from this URL at startup" can easily snowball into "if that URL is unavailable, this service will go down globally, and all running instances will fail to restart when the devops team try to do a pre-emptive rollback"
1 reply →
After reading about cloudflare infra in post mortems it has always been surprising how immature their stack is. Like they used to run their entire global control plane in a single failure domain.
Im not sure who is running the show there, but the whole thing seems kinda shoddy given cloudflares position as the backbone of a large portion of the internet.
I personally work at a place with less market cap than cloudflare and we were hit by the exact same instances (datacenter power went out) and had almost no downtime, whereas the entire cloudflare api was down for nearly a day.
What's the alternative here? Do you want them to replicate their infrastructure across different cloud providers with automatic fail-over? That sounds -- heck -- I don't know if modern devops is really up to that. It would probably cause more problems than it would solve...
7 replies →
Redundancy ≠ immune to failure.
Google is an advertising company not a tech company. Do not rely on them performing anything critical that doesn't depend on ad revenue.
2 replies →
Content Delivery Thread
Doesn't cloudflare have its own infrastructure, it's wild to me that both these things are down presumably together with this size of a blast radius.
Cloudflare isn't a cloud in the traditional sense; it's a CDN with extra smarts in the CDN nodes. CF's comparative advantage is in doing clever things with just-big-enough shared-nothing clusters deployed at every edge POP imaginable; not in building f-off huge clusters out in the middle of nowhere that can host half the Internet, including all their own services.
As such, I wouldn't be overly surprised if all of CF's non-edge compute (including, for example, their control plane) is just tossed onto a "competitor" cloud like GCP. To CF, that infra is neither a revenue center, nor a huge cost center worth OpEx-optimizing through vertical integration.
But then you do expose yourself to huge issues like this if your control plane is dependent on a single cloud provider, especially for a company that wants to be THE reverse proxy and CDN for the internet no?
3 replies →
They're pushing workers more as a compute platform
Plus their past outage reports indicate they should be running their own DC: https://blog.cloudflare.com/major-data-center-power-failure-...
Latest Cloudflare status update basically confirms that there is a dependency to GCP in their systems:
"Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable"
They lightly mentioned it in this interview a few weeks ago as well - I was surprised! https://youtu.be/C5-741uQPVU?t=1726s
Yeah I saw that now too. Interesting, I'm definitely a little surprised that they have this big of an external dependency surface.
1 reply →
You'd think so wouldn't you?
DownDetector also reports azure and oracle cloud, I can't see then also being dependant on GCP...
I guess down detector isn't a full source of truth though.
https://ocistatus.oraclecloud.com/#/ https://azure.status.microsoft/en-gb/status
Both green
Down detector has a problem when whole clouds go down: unexpected dependencies. You see an app on a non-problematic cloud is having trouble, and report it to Down Detector but that cloud is actually fine- their actual stuff is running fine. What is really happening is that the app you are using has a dependency on a different SaaS provider who runs on the problematic cloud, and that is killing them.
It's often things like "we got backpressure like we're supposed to, so we gave the end user an error because the processing queue had built up above threshold, but it was because waiting for the timeout from SaaS X slowed down the processing so much that the queue built up." (Have the scars from this more than once.)
2 replies →
Down Detector can have a poor signal to noise ratio given from what I am assuming is users submitting "this is broken" for any particular app. Probably compounded by many hearing of a GCP issue, checking their own cloud service, and reporting the problem at the same time.
Using Azure here, no issues reported so far.