Comment by whatevertrevor

3 days ago

Doesn't cloudflare have its own infrastructure, it's wild to me that both these things are down presumably together with this size of a blast radius.

16 comments

whatevertrevor

derefr 3 days ago

Cloudflare isn't a cloud in the traditional sense; it's a CDN with extra smarts in the CDN nodes. CF's comparative advantage is in doing clever things with just-big-enough shared-nothing clusters deployed at every edge POP imaginable; not in building f-off huge clusters out in the middle of nowhere that can host half the Internet, including all their own services.

As such, I wouldn't be overly surprised if all of CF's non-edge compute (including, for example, their control plane) is just tossed onto a "competitor" cloud like GCP. To CF, that infra is neither a revenue center, nor a huge cost center worth OpEx-optimizing through vertical integration.

whatevertrevor 3 days ago
But then you do expose yourself to huge issues like this if your control plane is dependent on a single cloud provider, especially for a company that wants to be THE reverse proxy and CDN for the internet no?
- snowwrestler 3 days ago
  
  Cloudflare does not actually want to reverse proxy and CDN the whole internet. Their business model is B2B; they make most of their revenue from a set of companies who buy at high price points and represent a tiny percentage of the total sites behind CF.
  Scale is just a way to keep costs low. In addition to economies of scale, routing tons of traffic puts them in position to negotiate no-cost peering agreements with other bandwidth providers. Freemium scale is good marketing too.
  So there is no strategic reason to avoid dependencies on Google or other clouds. If they can save costs that way, they will.
  
  1 reply →
- mbreese 3 days ago
  
  True, but how often do outages like this happen? And when outages do happen, does Cloudflare have any more exposure than Google? I mean, if Google can’t handle it, why should Cloudflare be expected to? It also looks like the Cloudflare services have been somewhat restored, so whatever dependency there is looks like it’s able to be somewhat decoupled.
  So long as the outages are rare, I don’t think there is much downside for Cloudflare to be tied to Google cloud. And if they can avoid the cost of a full cloud buildout (with multiple data centers and zones, etc…), even better.
arccy 3 days ago

They're pushing workers more as a compute platform
Plus their past outage reports indicate they should be running their own DC: https://blog.cloudflare.com/major-data-center-power-failure-...

smoe 3 days ago

Latest Cloudflare status update basically confirms that there is a dependency to GCP in their systems:

"Cloudflare’s critical Workers KV service went offline due to an outage of a 3rd party service that is a key dependency. As a result, certain Cloudflare products that rely on KV service to store and disseminate information are unavailable"

craigseeman 2 days ago

They lightly mentioned it in this interview a few weeks ago as well - I was surprised! https://youtu.be/C5-741uQPVU?t=1726s
whatevertrevor 3 days ago
Yeah I saw that now too. Interesting, I'm definitely a little surprised that they have this big of an external dependency surface.
- smoe 3 days ago
  
  Definitely very surprised to see, that so much of the CF products that are there to compete with the big cloud providers have such a dependance on GCP.

cyberpunk 3 days ago

You'd think so wouldn't you?

DownDetector also reports azure and oracle cloud, I can't see then also being dependant on GCP...

I guess down detector isn't a full source of truth though.

https://ocistatus.oraclecloud.com/#/ https://azure.status.microsoft/en-gb/status

Both green

mandevil 3 days ago
Down detector has a problem when whole clouds go down: unexpected dependencies. You see an app on a non-problematic cloud is having trouble, and report it to Down Detector but that cloud is actually fine- their actual stuff is running fine. What is really happening is that the app you are using has a dependency on a different SaaS provider who runs on the problematic cloud, and that is killing them.
It's often things like "we got backpressure like we're supposed to, so we gave the end user an error because the processing queue had built up above threshold, but it was because waiting for the timeout from SaaS X slowed down the processing so much that the queue built up." (Have the scars from this more than once.)
- spwa4 3 days ago
  
  Surely if you build a status detector you realize that colo or dedicated are your only options, no? Obviously you cannot host such a service in the cloud.
  
  1 reply →
iFred 3 days ago

Down Detector can have a poor signal to noise ratio given from what I am assuming is users submitting "this is broken" for any particular app. Probably compounded by many hearing of a GCP issue, checking their own cloud service, and reporting the problem at the same time.
basfo 3 days ago

Using Azure here, no issues reported so far.