Comment by JohnMakin

3 months ago

Large scale infrastructure changes are often by nature completely untestable. The system is too large, there are too many moving parts to replicate with any kind of sane testing, so often, you do find out in prod, which is why robust and fast rollback procedures are usually desirable and implemented.

11 comments

JohnMakin

lapcat 3 months ago

> Large scale infrastructure changes are often by nature completely untestable.

You're changing the subject here and shifting focus from the specific to the vague. The two postmortems after the recent major Cloudflare outages both listed straightforward errors in source code that could have been tested and detected.

Theoretical outages could theoretically have other causes, but these two specific outages had specific causes that we know.

> which is why robust and fast rollback procedures are usually desirable and implemented.

Yes, nobody is arguing against that. It's a red herring with regard to my point about source code testing.

JohnMakin 3 months ago
I am not changing any subject. These are glue logic scripts connecting massive pieces of infra together, spanning what is likely several teams and orgs over the course of many years. It is impossible to blurt something out like "well, source code testing" for something like this, when the source code inputs are not possibly testable outside the scale of the larger system. They're often completely unknowable as well.
With all due respect, it sounds like you have not worked on these types of systems, but out of curiosity - what type of test do you think would have prevented this?
- lapcat 3 months ago
  
  With all due respect, it sounds like you have never heard of unit tests.
  Cloudflare states that the compiler would prevent the bug in certain programming languages. So it seems ridiculous to suggest that the bug can't be detected outside the scale of a larger system.
  
  2 replies →

roguecoder 3 months ago

Akamai manages it.

winddude 3 months ago
They don't, akamai has had several outages as well jsut no one notices. Akamai is way way smaller than cloudflare, 20% of internet traffic passes through CF networks, not sure it's even measurable on Akamai.
- andrewf 3 months ago
  
  Quickly Googling about, a commonly repeated figure is that Akamai served 15% - 30% of Internet traffic in the late 2010's. They probably have less of the market today due to others growing, but they're not a minnow.
  2024 revenue figures were $1.669 billion for Cloudflare, and $3.99 billion for Akamai, per Wikipedia.
  
  3 replies →