Comment by paradite

2 months ago

The deployment pattern from Cloudflare looks insane to me.

I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.

The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.

I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.

For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

39 comments

paradite

vlovich123 2 months ago

That is also true at Cloudflare for what it’s worth. However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release, especially since there’s a 5 min lag (if I recall correctly) in the monitoring dashboards to get all the telemetry from thousands of servers worldwide.

Comparing the difficulty of running the world’s internet traffic with hundreds of customer products with your fintech experience is like saying “I can lift 10 pounds. I don’t know why these guys are struggling to lift 500 pounds”.

paradite 2 months ago

The fintech company I worked at does handle millions of QPS has has thousands of servers. It is on the same order of magnitude or at least 0.1x scale, not to mention the complexity of business logic involving monetary transactions.
If there’s indeed a 5 min lag in monitoring dashboard in Cloudflare, I honestly think that's a pretty big concern.
For example, a simple curl script on your top 100 customers' homepage that runs every 30 seconds would have given the warning and notifications within a minute. If you stagger deployments at 5 minute intervals, you could have identified the issue and initiated the rollback within 2 minutes and completed it within 3 minutes.
autoexec 2 months ago
> However, the company is so big that there’s so many different products all shipping at the same time it can be hard to correlate it to your release
This kind of thing would be more understandable for a company without hundreds of billions of dollars, and for one that hasn't centralized so much of the internet. If a company has grown too large and complex to be well managed and effective and it's starting to look like a liability for large numbers of people there are obvious solutions for that.
- evanelias 2 months ago
  
  What "hundreds of billions of dollars"? Cloudflare's annual revenue is around $2 billion, and they are not yet profitable.
  
  6 replies →
- pulkitsh1234 2 months ago
  
  Genuinely curious, how to actually implement detection systems for a large scale global infra which that works with < 1 minute SLO ? Given cost is no constraint.
  
  2 replies →
- vlovich123 2 months ago
  
  Can you name a major cloud provider that doesn’t have major outages?
  If this were purely a money problem it would have been solved ages ago. It’s a difficult problem to solve. Also, they’re the youngest of the major cloud providers and have a fraction of the resources that Google, Amazon, and Microsoft have.
  
  2 replies →
theplatman 2 months ago

With all due respect, engineers in finance can’t allow for outages like this because then you are losing massive amounts of money and potentially going out of business.

dehrmann 2 months ago

Cloudflare is orders of magnitude larger than any fintech. Rollouts likely take much longer, and having a human monitoring a dashboard doesn't scale.

notepad0x90 2 months ago
That means they engineered their systems incorrectly then? Precisely because they are much bigger, they should be more resilient. You know who's bigger than Cloudflare? tier-1 ISPs, if they had an outage the whole internet would know about it, and they do have outages except they don't cascade into a global mess like this.
Just speculating based on my experience: It's more likely than not that they likely refused to invest in fail-safe architectures for cost reasons. Control-plane and data-plane should be separate, a react patch shouldn't affect traffic forwarding.
Forget manual rollbacks, there should be automated reversion to a known working state.
- vlovich123 2 months ago
  
  > Control-plane and data-plane should be separate
  They are separate.
  > a react patch shouldn't affect traffic forwarding.
  If you can’t even bother to read the blog post maybe you shouldn’t be so confident in your own analysis of what should and shouldn’t have happened?
  This was a configuration change to change the buffered size of a body from 256kb to 1mib.
  The ability to be so wrong in so few words with such confidence is impressive but you may want to take more of a curiosity first approach rather than reaction first.
  
  5 replies →
cowsandmilk 2 months ago

> Rollouts likely take much longer
Cloudflare’s own post says the configuration change that resulted in the outage rolled out in seconds.
paradite 2 months ago

The blog post said the rollout of the config change took 1 minute.

markus_zhang 2 months ago

My guess is that CF has so many external customers that they need to move fast and try not to break things. My hunch is that their culture always favors moving fast. As long as they are not breaking too many things, customers won't leave them.

paradite 2 months ago

There is nothing wrong with moving fast and deploying fast.
I'm more talking about how slow it was to detect the issue caused by the config change, and perform the rollback of the config change. It took 20 minutes.
linhns 2 months ago

I think everyone favors moving fast. We humans want to see results of our action early.

theideaofcoffee 2 months ago

Same, my time at a F100 ecommerce retailer showed me the same. Every change control board justification needed an explicit back-out/restoration plan with exact steps to be taken, what was being monitored to ensure that was being held to, contacts of prominent groups anticipated to have an effect, emergency numbers/rooms for quick conferences if in fact something did happen.

The process was pretty tight, almost no revenue-affecting outages from what I can remember because it was such a collaborative effort (even though the board presentation seemed a bit spiky and confrontational at the time, everyone was working together).

prdonahue 2 months ago
And you moved at a glacial pace compared to Cloudflare. There are tradeoffs.
- theideaofcoffee 2 months ago
  
  Yes, of course, I want the organization that inserted itself into handling 20% of the world's internet traffic to move fast and break things. Like breaking the internet on a bi-weekly basis. Yep, great tradeoff there.
  Give me a break.
  
  3 replies →
lljk_kennedy 2 months ago

This sounds just as bad as yolo-merges, just on the other end of the spectrum.

nova22033 2 months ago

Speaking of fintech

https://www.henricodolfing.ch/case-study-4-the-440-million-s...

draw_down 2 months ago

[dead]