Comment by blibble

2 days ago

this is really amateur level stuff: NPEs, no error handling, no exponential backoff, no test coverage, no testing in staging, no gradual rollout, fail deadly

I read their SRE books, all of this stuff is in there: https://sre.google/sre-book/table-of-contents/ https://google.github.io/building-secure-and-reliable-system...

have standards slipped? or was the book just marketing

40 comments

blibble

rincebrain 15 hours ago

IMO, it's that any defense, humans or automated, is imperfect, and life is a series of tradeoffs.

You can write as many unit tests as you want, and integration tests that your system works as you expect on sample data, static analysis to scream if you're doing something visibly unsafe, staged rollout from nightly builds to production, and so on and so on, but eventually, at large enough scale, you're going to find a gap in those layered safety measures, and if you're unlucky, it's going to be a gap in all of them at once.

It's the same reasoning from said book as why getting another nine is always going to involve much more work than the previous ones - eventually you're doing things like setting up complete copies of your stack running stable builds from months ago and replaying all the traffic to them in order to be able to fail over to them on a moment's notice, meaning that you also can't roll out new features until the backup copies support it too, and that's a level of cost/benefit that nobody can pay if the service is large enough.

When working on OpenZFS, a number of bugs have come from things like "this code in isolation works as expected, but an edge case we didn't know about in data written 10 years ago came up", or "this range of Red Hat kernels from 3 Red Hat releases ago has buggy behavior, and since we test on the latest kernel of that release, we didn't catch it".

Eventually, if there's enough complexity in the system, you cannot feasibly test even all the variation you know about, so you make tradeoffs based on what gets enough benefit for the cost.

(I'm an SRE at Google, not on any team related to this incident, all opinions unofficial/my own, etc.)

singron 2 days ago

Nearly every global outage at Google has looked vaguely like this. I.e. a bespoke system that rapidly deploys configs globally gets a bad config.

All the standard tools for binary rollouts and config pushes will typically do some kind of gradual rollout.

In some ways Google Cloud had actually greatly improved the situation since a bunch of global systems were forced to become regional and/or become much more reliable. Google also used to have short global outages that weren't publicly remarked on (at the time, if you couldn't connect to Google, you assumed your own ISP was broken), so this event wasn't as rare as you might think. Overall I don't think there is a worsening trend unless someone has a spreadsheet of incidents proving otherwise.

[I was an SRE at Google several years ago]

btown 1 day ago
From the OP:
> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds
If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”
The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.
- palcu 1 day ago
  
  One of the problems has been that most users have requested that quotas get updated as fast as possible and that they should be consistent across regions, even for global quotas. As such people have been prioritising user experience rather than availability.
  I hope the pendulum swings the other way around now in the discussion.
  [disclaimer that I worked as a GCP SRE for a long time, but not left recently]
  
  3 replies →

marcinzm 2 days ago

As an outsider my quick guess is that at some point after enough layoffs and the CEO accusing everyone of being lazy, people focus on speed/perceived output over quality. After a while the culture shifts so if you block such things then you're the problem and will be ostracized.

gjsman-1000 2 days ago
As an outsider, what I perceive is quite different:
HN likes to pretend that FAANG is the pinnacle of existence. The best engineers, the best standards, the most “that wouldn’t have happened here,” the yardstick by which all companies should be measured for engineering prowess.
Incidents like this repeatedly happening reveal that’s mostly a myth. They aren’t much smarter, their standards are somewhat wishful thinking, their accomplishments are mostly rooted in the problems they needed to solve just like any other company.
- spacemadness 1 day ago
  
  That’s just PR that serves these companies. I’ve never seen them that way. The stupid avoidable bugs and terrible UX in a lot of their products tells you enough at the surface level. What’s true is these companies do hire some amazing specialists but that doesn’t make them the pinnacle of engineering overall.
- dvfjsdhgfv 1 day ago
  
  You might be right to some extend, but not entirely. For example, there have been almost no incidents in AWS where one customer would be able to access the data of another customer because of AWS fault. The cases so far like Superglue etc. were very limited and IMHO AWS security is quite solid.
  So I would say there is a difference between AWS architects and engineers (although I know first hand that certain things are subobtimal, but...) and those of several other companies who have less customers but experienced successful attacks (or data loss). Even if you take Microsoft, there is huge difference in security posture between AWS and Azure (and I say this as a big fan of the so-called "private cloud" (previously know as just your own infra)).
  
  3 replies →
- motorest 1 day ago
  
  > Incidents like this repeatedly happening reveal that’s mostly a myth. They aren’t much smarter, their standards are somewhat wishful thinking, their accomplishments are mostly rooted in the problems they needed to solve just like any other company.
  I think you're only seeing what you want to see, because somehow bringing FANG engineers down a peg makes you feel better?
  A broken deployment due to a once-in-a-lifetime configuration change in a project that wasn't allocated engineering effort to allow more robust and resilient deployment modes doesn't turn any engineer into an incompetent fool. Sometimes you need to flip a switch, and you can't spare a team working one year to refactor the whole thing.
  
  3 replies →

kyrra 2 days ago

I wish they would share more details here. Your take isn't fully correct. There was testing, just not for the bad input (the blank fields in the policy). They also didn't say there was no testing in staging, just that a flag would have caught it.

Opinions are my own.

fidotron 2 days ago
> There was testing, just not for the bad input (the blank fields in the policy).
But if you change a schema, be it DB, protobuf, whatever, this is the major thing your tests should be covering.
This is why people are so amazed by it.
- kyrra 2 days ago
  
  Sorry, I meant that was unit testing.
  The document also doesn't say there wasn't testing in staging or prod.
  
  1 reply →

ajb 2 days ago

...the constant familiarity with even the most dangerous instruments soon makes men loose their first caution in handling them; they readily, therefore, come to think that the rules laid down for their guidance are unnecessarily strict - report on the explosion of a gunpowder magazine at Erith, 1864

belter 2 days ago
The incredible reliability and high standards in most of the air travel industry proves this wrong.
- ajb 1 day ago
  
  Yes, I don't say that a quote from 1864 is the last word on work culture. Nevetheless, it does capture something of human nature: what is now known as "normalisation of deviance".
- mplanchard 1 day ago
  
  I guess you’ve never heard pilots complain about the FAA

gyesxnuibh 1 day ago

That book was written with 40% of the engineers compared to when I left a couple years ago (not sure how many now with the layoffs now). I'm guessing those hires haven't read it yet. So yeah, reads like standards slipping to me.

maigret 2 days ago

Try to a few Lighthouse measurements on their web pages and you’ll see they don’t maintain the highest engineering standards.

koakuma-chan 2 days ago

Yep, people (including me) like to shit on Next.js, but I have a fairly complex app, and it’s still at 100 100 100 100

perryizgr8 1 day ago

At google scale, if their standards were not sky high, such incidents would be happening daily. That it happens once in a blue moon indicates that they are really meticulous with all those processes and safeguards almost all the time.

tough 2 days ago

someone vibe coded a push to prod on friday?

koakuma-chan 2 days ago

Ironically Gemini wouldn’t forget a null check

belter 2 days ago

- You do know their AZs are just firewalls across the same datacenter?

- And they used machines without ECC and their index got corrupted because of it? And instead of hiding the head in shame and getting lessons from IBM old timers they published a paper about it?

- What really accelerated the demise of Google+ was that an API issue allowed the harvesting of private profile fields for millions of users, and they hid that for months fearing the backlash...

Dont worry, you will have plenty more outages from the land of we only hire the best....

atombender 1 day ago
I wonder if machines without ECC could perhaps explain why our apps periodically see TCP streams with scrambled contents.
On GKE, we see different services (like Postgres and NATS) running on the same VM in different containers receive/send stream contents (e.g. HTTP responses) where the packets of the stream have been mangled with the contents of other packets. We've been seeing it since 2024, and all the investigation we've done points to something outside our apps and deeper in the system. We've only seen it in one Kubernetes cluster, and it lasts 2-3 hours and then magically resolves itself; draining the node also fixes it.
If there are physical nodes with faulty RAM, I bet something like this could happen. Or there's a bug in their SDN or their patched version of the Linux kernel.
- mnordhoff 10 hours ago
  
  This sounds like a very serious security vulnerability...?
- belter 1 day ago
  
  You could start by running similar parallel infra on AWS that is ECC everywhere...And also check the corrupted TCP streams for single-bit flip patterns, also maybe correlate timing with memory pressure as that would be what and when the RAM errors would typically show. If its more than just bit flips, could be something else.
  
  3 replies →

gjsman-1000 2 days ago

Standards fallen?

Google’s standards, and from what I can tell, most FAANG standards are like beauty filters on Instagram. Judging yourself, or any company, against them is delusional.