Comment by itzjacki

7 hours ago

A colleague of mine just came bursting through my office door in a panic, thinking he brought our site down since this happened just as he made some changes to our Cloudflare config. He was pretty relieved to see this post.

Tell him it's worse than he thinks. He obviously brought the entire Cloudflare system down.

  • You joke and I think its funny, but as a junior engineer I would be quite proud if some small change I made was able to take down the mighty Cloudflare.

    • If I were Cloudflare it would mean an immediate job offer well above market. That junior engineer is either a genius or so lucky that they must be bred by Pierson’s Puppeteers or such a perfect manifestation of a human fuzzer that their skills must be utilized.

      3 replies →

    • I kind of did that back in the days when they released Worker KV, I tried to bulk upload a lot of data and it brought the whole service down, can confirm I was proud :D

    • It's also not exactly the least common way that this sort of huge multi-tenant service goes down. It's only as rare as it is because more or less all of them have had such outages in the past and built generic defenses (e.g. automated testing of customer changes, gradual rollout, automatic rollback, there are others but those are the ones that don't require any further explanation).

    • Well its easy to cause damage by messing up the `rm` command, esp with `-fr` options. So don't take it as a proxy for some great skill which is required to cause damage.

      1 reply →

Well, you can never be sure that he didn't:

https://www.fastly.com/blog/summary-of-june-8-outage

  • It's also what was the cause of the Azure Front Doors global outage two weeks ago - https://aka.ms/air/YKYN-BWZ

    "A specific sequence of customer configuration changes, performed across two different control plane build versions, resulted in incompatible customer configuration metadata being generated. These customer configuration changes themselves were valid and non-malicious – however they produced metadata that, when deployed to edge site servers, exposed a latent bug in the data plane. This incompatibility triggered a crash during asynchronous processing within the data plane service. This defect escaped detection due to a gap in our pre-production validation, since not all features are validated across different control plane build versions."

  • Oh don't you worry. We are very much talking about the global outage as if he was the root cause. Like good colleagues :)

  • > May 12, we began a software deployment that introduced a bug that could be triggered by a specific customer configuration under specific circumstances.

    I'd love to know more about what those specific circumstances were!

  • I'm pretty sure I crashed Gmail using something weird in its filters. It was a few years ago. Every time I did something specific (I don't remember what), it would freeze and then display a 502 error for a while.

  • Damn, imagine being the customer responsible for that, oof

    • What do you imagine would be the result if you brought down cloudflare with a legitimate config update (ie not specifically crafted to trigger known bugs) while not even working for them? If I were the customer "responsible" for this outage, I'd just be annoyed that their software is apparently so fragile.

    • I would be fine if it was my "fault", but I'm sure people in business would find a way to make me suffer.

      But on a personal level, this is like ordering something at a restaurant and the cook burning the kitchen because they forgot to take out your pizza out of the oven or something.

      I would be telling it to everyone over beers (but not my boss).

      1 reply →

Is there a word for that feeling of relief when someone else fucked up after initially thinking it was you?

  • What’s funny is as I get older this feeling of relief turns more like a feeling of dread. The nice thing about problems that you cause is that you have considerable autonomy to fix them. Cloudflare goes down you’re sitting and waiting for a 3 party to fix something.

    • When others cause problems then you can put your feet up and surf the web waiting for resolution. Oh, wait.

  • The problem is, I still get the wrong end of the stick when AWS or CF go down! Management doesn't care, understandably. They just want the money to keep coming in. It's hard to convince them that this is a pretty big problem. The only thing that will calm them down a bit is to tell them Twitter is also down. If that doesn't get them, I say ChatGPT is also down. Now NOBODY will get any work done! lol.

    • This is why you ALWAYS have a proposal ready. I literally had my ass saved by having tickets with reliability/redundancy work clearly laid out with comments by out of touch product/people managers deprioritizing the work after attempts to pull it off the backlog (in one infamous case for a notoriously poorly conceived and expensive failure of a project that haunted us again with lost opportunity cost).

      The hilarious part of the whole story is that the same PMs and product managers were (and I cannot overemphasize this enough) absolutely militant orthodox agile practitioners with jira.

    • Every time a major cloud goes down, management tells us why don't we have a backup service that we can switch to. Then I tell them that a bunch of services worth a lot more than us are also down. Do you really want to spend the insane amount of resources to make sure our service stays up when the global internet is down?

      1 reply →

    • Who decided to go with AWS of CF? If its a management decision tell them you need the resources to have a fallback if they want their system to be more reliable than AWS or CF.

    • Haha yeah I just got off the phone and I said, look, either this gets fixed soon or there's going to be news headlines with photographs of giant queues of people milling around in airports.

  • When I'm debugging something, I'm not usually looking for the solution to the problem; I'm looking for sufficient evidence that I didn't cause the problem. Once I have that, the velocity at which I work slows down

  • Schadenfriend?

    You gain relief, but you don't exactly derive pleasure as it's someone you know that's getting the ass end of the deal

  • Maybe this isn’t great, but I get a hint of that feeling when I’m on an airplane and hear a baby crying. For a number of years, if I heard a baby crying, it was probably my baby and I had to deal with it. But now my kids are past that phase, so when I hear the crying, after that initial jolt of panic I realize that it isn’t my problem, and that does give me the warm fuzzies. Even though I do feel bad for the baby and their parents.

    • Related situation: you're at a family gathering and everyone has young kids running around. You hear a thump, and then some kid starts screaming. Conversation stops and every parent keenly listens to the screams to try and figure out whose kid just got hurt, then some other parent jumps up - it's not your kid! #phewphoria

    • You're not alone in this feeling. I occasionally smile when it's not my kid.

I woke up getting bombarded by multiple clients messages of sites not working, I shitted my pants because I've changed the config just yesterday. When I saw the status message "cloudflare down" I was so relieved.

Good that he worked it out so quick. I recently spent a day debugging email problems on Railway PaaS, because they silently closed an SMTP port without telling anyone.

You missed a great opportunity to dead-pan him with something like "No, Bob, not just our site, you brought down the entire Internet, look at this post!"

Chances are still good that somewhere within Cloudflare someone really did do a global configuration push that brought down the internet.

When aliens study humans from this period, their book of fairy tales will include several where a terrible evil was triggered by a config push.

Wait for the post mortem ... It is a technical possibility, race condition propagates one customer config to all nodes... :-)