Comment by dehrmann

2 months ago

Cloudflare is orders of magnitude larger than any fintech. Rollouts likely take much longer, and having a human monitoring a dashboard doesn't scale.

9 comments

dehrmann

notepad0x90 2 months ago

That means they engineered their systems incorrectly then? Precisely because they are much bigger, they should be more resilient. You know who's bigger than Cloudflare? tier-1 ISPs, if they had an outage the whole internet would know about it, and they do have outages except they don't cascade into a global mess like this.

Just speculating based on my experience: It's more likely than not that they likely refused to invest in fail-safe architectures for cost reasons. Control-plane and data-plane should be separate, a react patch shouldn't affect traffic forwarding.

Forget manual rollbacks, there should be automated reversion to a known working state.

vlovich123 2 months ago
> Control-plane and data-plane should be separate
They are separate.
> a react patch shouldn't affect traffic forwarding.
If you can’t even bother to read the blog post maybe you shouldn’t be so confident in your own analysis of what should and shouldn’t have happened?
This was a configuration change to change the buffered size of a body from 256kb to 1mib.
The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.
- notepad0x90 2 months ago
  
  You really should take some of your pill.
  > Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.
  > Unfortunately, in our FL1 version of our proxy, under certain circumstances, the second change of turning off our WAF rule testing tool caused an error state that resulted in 500 HTTP error codes to be served from our network.
  The body parsing logic is in react or nextjs, that's my takeaway, is it that incorrect? and the WAF rule testing tool (control plane) was interdependent with the WAF's body parsing logic, is that also incorrect?
  > This was a configuration change to change the buffered size of a body from 256kb to 1mib.
  Yes, and if it was resilient,the body parsing is done on a discrete forwarding plane. Any config changes should be auto-tested for forwarding failures by the separate control plane and auto-revered when there are errors. If the waf rule testing tool was part of that test then it being down shouldn't have affected data-plane because it would be a separate system.
  data/control plane separate means the run time of the two and any dependencies they have are separate. It isn't cheap to do this right, that's why I speculated (I made clear i was speculating) that it was because they wanted to save costs.
  > The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.
  Please tone down the rage a bit and leave room for some discussion. You should take your own pill and be curious about what I meant instead of taking a rage-first approach.
  
  4 replies →

cowsandmilk 2 months ago

> Rollouts likely take much longer

Cloudflare’s own post says the configuration change that resulted in the outage rolled out in seconds.

paradite 2 months ago

The blog post said the rollout of the config change took 1 minute.