The incident report is interesting. Fast reaction time by the SRE team (2 minutes), then the "red button" rollout. But then "Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load."
In my experience this happens more often than not: In an exceptional situation like a recovery of many nodes quotas that make sense in regular operations get exceeded quickly and you run into another failure scenario. As long as the underlying infrastructure can cope with it, it's good if you can disable quotas temporarily and quickly. Or throttle the recovery operations that naturally take longer in that case.
this is really amateur level stuff: NPEs, no error handling, no exponential backoff, no test coverage, no testing in staging, no gradual rollout, fail deadly
Nearly every global outage at Google has looked vaguely like this. I.e. a bespoke system that rapidly deploys configs globally gets a bad config.
All the standard tools for binary rollouts and config pushes will typically do some kind of gradual rollout.
In some ways Google Cloud had actually greatly improved the situation since a bunch of global systems were forced to become regional and/or become much more reliable. Google also used to have short global outages that weren't publicly remarked on (at the time, if you couldn't connect to Google, you assumed your own ISP was broken), so this event wasn't as rare as you might think. Overall I don't think there is a worsening trend unless someone has a spreadsheet of incidents proving otherwise.
> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds
If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”
The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.
As an outsider my quick guess is that at some point after enough layoffs and the CEO accusing everyone of being lazy, people focus on speed/perceived output over quality. After a while the culture shifts so if you block such things then you're the problem and will be ostracized.
As an outsider, what I perceive is quite different:
HN likes to pretend that FAANG is the pinnacle of existence. The best engineers, the best standards, the most “that wouldn’t have happened here,” the yardstick by which all companies should be measured for engineering prowess.
Incidents like this repeatedly happening reveal that’s mostly a myth. They aren’t much smarter, their standards are somewhat wishful thinking, their accomplishments are mostly rooted in the problems they needed to solve just like any other company.
IMO, it's that any defense, humans or automated, is imperfect, and life is a series of tradeoffs.
You can write as many unit tests as you want, and integration tests that your system works as you expect on sample data, static analysis to scream if you're doing something visibly unsafe, staged rollout from nightly builds to production, and so on and so on, but eventually, at large enough scale, you're going to find a gap in those layered safety measures, and if you're unlucky, it's going to be a gap in all of them at once.
It's the same reasoning from said book as why getting another nine is always going to involve much more work than the previous ones - eventually you're doing things like setting up complete copies of your stack running stable builds from months ago and replaying all the traffic to them in order to be able to fail over to them on a moment's notice, meaning that you also can't roll out new features until the backup copies support it too, and that's a level of cost/benefit that nobody can pay if the service is large enough.
When working on OpenZFS, a number of bugs have come from things like "this code in isolation works as expected, but an edge case we didn't know about in data written 10 years ago came up", or "this range of Red Hat kernels from 3 Red Hat releases ago has buggy behavior, and since we test on the latest kernel of that release, we didn't catch it".
Eventually, if there's enough complexity in the system, you cannot feasibly test even all the variation you know about, so you make tradeoffs based on what gets enough benefit for the cost.
(I'm an SRE at Google, not on any team related to this incident, all opinions unofficial/my own, etc.)
I wish they would share more details here. Your take isn't fully correct. There was testing, just not for the bad input (the blank fields in the policy). They also didn't say there was no testing in staging, just that a flag would have caught it.
...the constant familiarity with even the most dangerous instruments soon makes men loose their first caution in handling them; they readily, therefore, come to think that the rules laid down for their guidance are unnecessarily strict - report on the explosion of a gunpowder magazine at Erith, 1864
That book was written with 40% of the engineers compared to when I left a couple years ago (not sure how many now with the layoffs now). I'm guessing those hires haven't read it yet. So yeah, reads like standards slipping to me.
At google scale, if their standards were not sky high, such incidents would be happening daily. That it happens once in a blue moon indicates that they are really meticulous with all those processes and safeguards almost all the time.
- You do know their AZs are just firewalls across the same datacenter?
- And they used machines without ECC and their index got corrupted because of it? And instead of hiding the head in shame and getting lessons from IBM old timers they published a paper about it?
- What really accelerated the demise of Google+ was that an API issue allowed the harvesting of private profile fields for millions of users, and they hid that for months fearing the backlash...
Dont worry, you will have plenty more outages from the land of we only hire the best....
I wonder if machines without ECC could perhaps explain why our apps periodically see TCP streams with scrambled contents.
On GKE, we see different services (like Postgres and NATS) running on the same VM in different containers receive/send stream contents (e.g. HTTP responses) where the packets of the stream have been mangled with the contents of other packets. We've been seeing it since 2024, and all the investigation we've done points to something outside our apps and deeper in the system. We've only seen it in one Kubernetes cluster, and it lasts 2-3 hours and then magically resolves itself; draining the node also fixes it.
If there are physical nodes with faulty RAM, I bet something like this could happen. Or there's a bug in their SDN or their patched version of the Linux kernel.
Google’s standards, and from what I can tell, most FAANG standards are like beauty filters on Instagram. Judging yourself, or any company, against them is delusional.
I work on Cloud, but not this service. In general:
- All the code has unit tests and integration tests
- Binary and config file changes roll out slowly job by job, region by region, typically over several days. Canary analysis verifies these slow rollouts.
- Even panic rollbacks are done relatively slowly to avoid making the situation worse. For example globally overloading databases with job restarts. A 40m outage is better than a 4 hour outage.
I have no insider knowledge of this incident, but my read of the PM is: The code was tested, but not this edge case. The quota policy config is not rolled out as a config file, but by updating a database. The database was configured for replication which meant the change appeared in all the databases globally within seconds instead of applying job by job, region by region, like a binary or config file change.
I agree on the frustration with null pointers, though if this was a situation the engineers thought was impossible it could have just as likely been an assert() in another language making all the requests fail policy checks as well.
Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded, that all quota policy checks fail open, and that db changes roll out slowly region by region.
Disclaimer: this is all unofficial and my personal opinions.
That's fair, though `if (isInvalidPolicy) reject();` causes the same outage. So the eng process policy change seems to be failing open and slow rollouts to catch that case too.
How is the fact that it was a database change and not a binary or a config supposed to make it ok? A change is a change, global changes that go everywhere at once are a recipe for disaster, it doesn't matter what kind of changes we're talking about. This is a second Crowdsrike
This is the core point. A canary deployment that was not preceded by deploying data that activates the region of the binary in question will prove nothing useful at all, while promoting a false sense of security.
It’s interesting that multi-region is often touted as a mechanism for resilience and availability, but for the most part, large cloud providers seem hopelessly intertwined across regions during outages like these.
The root cause of this incident was leadership driving velocity by cutting corners. It has been going on for years, eventually over the cliff.
This specific failure mode is known as query of death. A query triggers an existing bug that causes the server to crash. It is inevitable for C++ servers.
Service Control is in C++. It uses a comprehensive set of engineering guidelines to minimize and tolerate query of death and other failure modes. Before this incident, it had no major incident in the previous decade.
This incident is related to a new global quota policy. It was built quickly under leadership pressure, cutting corners. Such features should be built in a secondary service, or at least following the established engineering guidelines.
Regarding the action items mentioned in the report, the established engineering guidelines far exceed them. The team has been keeping up with their standard as much as they can.
> This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop.
Another example of Hoare’s “billion-dollar mistake” in multiple multiple Google systems:
- Why is it possible to insert unintended “blank fields” (nulls)? The configuration should have a schema type that doesn’t allow unintended nulls. Unfortunately Spanner itself is SQL-like and so fields must be declared NOT NULL explicitly, the default is nullable fields.
- Even so, the program that manages these policies will have its own type system and possibly an application level schema language for the configuration. This is another opportunity to make invalid states unrepresentable.
- Then in Service Control, there’s an opportunity to prove “schema on read” as you deserialize policies from the data store into application objects, again either a programming language type or application level schema could be used to validate policy rows have the expected shape before they leave the data layer. Perhaps the null pointer error occurred in this layer, but since this issue occurred in a new code path, it sounds more likely the invalid data escaped the data layer into application code.
- Finally, the Service Control application is written in a language that allows for null pointer references.
If I were a maintainer of this system, the minimally invasive chance I would be thinking about, is how to introduce an application level schema to the policy writer and the policy reader that uses a “tagged enum type” or “union type” or “sum type” to represent policies that cannot express null. Ideally each new kind of policy could be expressed as a new variant added to the union type. You can add this in app code without rewriting the whole program to a safe language. Unfortunately it seems proto3, google’s usual schema language, doesn’t have this constraint.
Everyone loves to criticise downtime when it happens to others, saying these are “junior level mistakes” and so on. Until it happens to them, and then there’s a convenient excuse as to why it was either unavoidable or unforeseeable. Truth is humans make mistakes and the expectations are too high.
When a brick and mortar business has to shut unexpectedly they’ll put a sign on the door, apologise and that’s that. Only in tech do we stress so much about a few hours per year. I wish everyone would relax a bit.
Google post-mortems never cease to amaze me. From seeing it inside the company to outside. The level of detail, its amazing. The thing is. They will never make the same mistake again. They learn from it, put in the correct protocols and error handling and then create an even more robust system. The thing is, at the scale of Google there is always something going wrong, the point is, how is it being handled not to affect the customer/user and other systems. Honestly it's an ongoing thing you don't see unless you're inside and even then on a per team basis you might see things no one else is seeing. It is probably the closet we're going to come to the most complex systems of the universe, because we as humans will never do better than this. Maybe AGI does, but we won't.
But this is a whole series of junior level mistakes:
* Not dealing with null data properly
* Not testing it properly
* Not having test coverage showing your new thing is tested
* Not exercising it on a subset of prod after deployment to show it works without falling over before it gets pushed absolutely everywhere
Standards in this industry have dropped over the years, but by this much? If you had done this 10 years ago as a Google customer for something far less critical everyone on their side would be smugly lolling at you, and rightly so.
This is _hardly_ a "junior level mistake". That kind of bug is pervasive in all the languages they're likely using for this service (Go, Java, C++) written even by the most "senior" developers.
As I understand it the outage was caused by several mistakes:
1) A global feature release that went everywhere at the same time
2) Null pointer dereference
3) Lack of appropriate retry policies that resulted in a thundering herd problem
All of these are absolutely standard mistakes that everyone who's worked in the industry for some time has seen numerous times. There is nothing novel here, no weird distibited system logic, no google scale, just rookie mistakes all the way
For a company the size and quality of Google to be bringing down the majority of their stack with this type of error really suggests they do not implement appropriate mitigations after serious issues.
They rolled out a change without feature flagging, didn’t implement exponential backoffs in the clients, didn’t implement load shedding in the servers.
This is all in the google SRE book from many years ago.
This is literally the same mistake that has been made many times before. Of course it will be made again. “New feature is rolled out carefully with a bug that remains latent until triggered by new data” could summarize most global outages.
The thing is, nobody is perfect. Except armchair HN commenters on threads about FAANG outages, of course.
I recently started as a GCP SRE. I don't have insider knowledge about this, and my views on it are my own.
The most important thing to look at is how much had to go wrong for this to surface. It had to be a bug without test coverage that wasn't covered by staged rollouts or guarded by a feature flag. That essentially means a config-in-db change. Detection was fast, but rolling out the fix was slow out of fear of making things worse.
The NPE aspect is less interesting. It could have been any number of similar "this can't happen" errors. It could have been mutually exclusive fields are present in a JSON object, and the handling logic does funny things. Validation during mutation makes sense, but the rollout strategy is more important since it can catch and mitigate things you haven't thought of.
Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.
This reads to me like someone finally won an argument they’d been having for some time.
More like yolo deploys without proper scaffolding in place to handle SHTF. Also, if your red-button takes 40 minutes and deploys to mitigate anything, it isn't a red-button.
> Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.
Err yeah but the point is languages without null pointers all over the place make it harder to do that in the first place. You normally get some kind of type error at compile time.
> The issue with this change was that it did not have appropriate error handling nor was it feature flag protected.
I've been there. The product guy needs the new feature enabled for everyone, and he needs it yesterday. Suggestions of feature flagging are ignored outright. The feature is then shipped for every user, and fun ensues.
Usually Google and FAANG outages in general are due to things that happen only at Google scale but this incident seems from a generic small/medium company with 30 engineers at most.
I've seen it happen at companies with 100s of engineers, some of them ex-FAANG, including ex-Googlers. The average FAANG engineer has a way more pedestrian work ethic than the FAANG engineers on HN want you to believe.
Why did it take so long for Google to update their status page at https://www.google.com/appsstatus/dashboard/? According to this report, the issue started at 10:49 am PT, but when I checked the Google status page at 11:35 am PT, everything was still green. I think this is something else they need to investigate.
> We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.
If I understood it correctly, this service checks proper authorisation among other things, so isn’t failing open a security risk?
This was the bit that I spotted as potentially conflicting as well. Having managed (and sanitised!) tech&security policies at a small tech company, the fail-open vs. fail-closed decisions are rarely clear cut. What makes it worse is that a panicked C-suite member can make a blanket policy decision without consulting anyone outside their own circle.
The downstream effects tend to be pretty grim, and to make things worse, they start to show up only after 6 months. It's also a coinflip whether a reverse decision will be made after another major outage - itself directly attributable to the decisions made in the aftermath of the previous one.
What makes these kinds of issues particularly challenging is that by their very definition, the conditions and rules will be codified deep inside nested error handling paths. As an engineer maintaining these systems, you are outside of the battle tested happy paths and first-level unhappy paths. The conditions to end up in these second/third-level failure modes are not necessarily well understood, let alone reproducible at will. It's like writing code in C, and having all your multi-level error conditions be declared 'volatile' because they may be changed by an external force at any time, behind your back.
The big issue here (other than the feature rollout) is the lack of throttling. Exponential backoff is a fairly standard integration for scaled applications. Most cloud services use it. I’m surprised it wasn’t implemented for something as fundamental as Service Control.
My reading is that this is on start up. E.g., some config needs to be read when tasks come up. It's easy to have backoff in your normal API path, but miss it in all the other places your code talks to services.
Maybe it's just my lack of reading comprehension but some of the wording in this report feels off:
> this code change came with a red-button to turn off that particular policy serving path.
> the root cause was identified and the red-button (to disable the serving path) was being put in place
So the red-button was or wasn't in place on May 29th? The first sentence implies it was ready to be used but the second implies it had to be added. A red-button sounds like a thing that's already in place and can be triggered immediately, but this sounds like an additional change had to be deployed?
> Without the appropriate error handling, the null pointer caused the binary to crash
This is the first mention of null pointer (and _the_ null pointer too, not just _a_ null pointer) this implies the specific null pointer that would have caused a problem was known at this point? And this wasn't an early issue?
I don't mean to play armchair architect and genuinely want to understand this from a blameless post-mortem point-of-view given the scale of the incident, but the wording in this report doesn't quite add up.
So code that was untested (the code path that failed was never exercised), perhaps there is no test environment, and not even peer reviewed ( it did not have appropriate error handling nor was it feature flag protected.) was pushed to production, what a surprise !!
There absolutely is a test environment, it was absolutely reviewed and Google has absolutely spent Moon-landing money on testing and in particular static analysis.
Moon landing money on static analysis that failed to identify the existence of a completely untested code path? Or even to shake this out with random data generation?
This is a dumbfounding level of mistake for an organization such as Google.
(Throwaway since I was part of a related team a while back)
Service Control (Chemist) is a somewhat old service, been around for about a decade, and is critical for a lot of GCP APIs for authn, authz, auditing, quota etc. Almost mandated in Cloud.
There's a proxy in the path of most GCP APIs, that calls Chemist before forwarding requests to the backend. (Hence I don't think fail open mitigation mentioned in post-mortem will work)
Both Chemist and the proxy are written in C++, and have picked up a ton of legacy cruft over the years.
The teams have extensive static analysis & testing, gradual rollouts, feature flags, red buttons and strong monitoring/alerting systems in place. The SREs in particular are pretty amazing.
Since Chemist handles a lot of policy checks like IAM, quotas, etc., other teams involved in those areas have contributed to the codebase. Over time, shortcuts have been taken so those teams don’t have to go through Chemist's approval for every change.
However, in the past few years, the organization’s seen a lot of churn and a lot of offshoring too. Which has led to a bigger focus on flashy, new projects led by L8/L9s to justify headcount instead of prioritizing quality, maintenance, and reliability. This shift has contributed to a drop in quality standards and increased pressure to ship things out faster (and one of the reasons I ended up leaving Cloud).
Also many of the servers/services best practices common at Google are not so common here.
That said, in this specific case, it seems like the issue is more about lackluster code and code review. (iirc code was merged despite some failures). And pushing config changes instantly through Spanner made it worse.
> We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage.
Getting approval from marketing/senior leadership to setup this thing that is outside normal infrastructure is very difficult so it was probably decided against.
>We will enforce all changes to critical binaries to be feature flag protected and disabled by default.
Not only if this was true it would kill developer productivity, but how can this even be done. When the compiler versions bumps that's a lot changes that need to be gated. Every team that works on a dependency will have to add feature flags for any change and bug fix they make.
Yeah, some of their takeaways seem overly stifling. This sounds like it wasn't a case of a broken process or missing ingredients. All the tools to prevent it were there (feature flags, null handling, basic static analysis tools). Someone just didn't know to use them.
This also got a laugh:
We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage. For some customers, the monitoring infrastructure they had running on Google Cloud was also failing, leaving them without a signal of the incident or an understanding of the impact to their business and/or infrastructure.
You should always have at least some kind of basic monitoring that's on completely separate infrastructure, ideally from a different vendor. (And maybe Google should too)
I've actually seen this happen before. Most changes get deployed with a gradually enabled(% of users/cities/...) feature flag and then cleaned up later on. There is a slack notification from the central service which manages them that tells you how many rollouts are complete reminding you to clean them up. It escalates to SRE if you don't pay heed for long enough.
In the rollout duration, the combinatorial explosion of code paths is rather annoying to deal with. On the other hand, it does encourage not having too many things going on at once. If some of the changes affect business metrics, it will be hard to glean any insights if you have too many of them going on at once.
And yet it's not even close to the productivity hurting rules and procedures followed for aerospace software.
Rules in aerospace are written in blood. Rules of internet software are written in inconvenience, so productivity is usually given much higher priority than reducing risk of catastrophic failure.
Whatever this "red-button" technology is, is pants. If you know you want to turn something off at incident + 10 mins, it should be off within a minute. Not "Preparing a change to trigger the red-button", but "the stop flag was set by an operator in a minute and was synched globally within seconds".
I mean, it's not like they don't have that technology: the worldwide sync was exactly what caused the outage.
Generally, even these emergency changes are done not entirely immediately to prevent a fix from making things worse. This is an operational choice though, not a technical limitation. My guess being involved in similar issues in the past is the ~15 minute delay preparing the change was either that it wasn't a normally used big red button, so it wasn't clear how to use it, or there was some other friction preparing the change.
More critically, some service allowed for unvalidated instant rollout of policy changes to prod. The actual bug in the code is largely irrelevant compared to the first issue.
Processes and systems have flaws that can be fixed, humans always will make mistakes.
It doesn't exactly, no. There are linters and a compiler-enforced check preventing unused variables. Overall it's pretty easy to accidentally drop errors or overwrite them before checking.
That is still the most bizarre error handling pattern that is just completely accepted by the Go community. And after a decade of trying to fix it, the Go team just recently said that they've given up.
It's not that bizarre, it's exactly how errors used to be handled in older languages like C. Golang is not strange, it's just outdated. It was created by Unix greybeards, after all
Really whatever null condition caused a crash is mostly irrelevant. The big problem is instantaneous global replication of policy changes . All code can fail unexpectedly, the gradual rollout of the code was pointless since it doesn't take effect until any policy changes. Yet policy changes are near instantaneous.
The incident report is interesting. Fast reaction time by the SRE team (2 minutes), then the "red button" rollout. But then "Within some of our larger regions, such as us-central-1, as Service Control tasks restarted, it created a herd effect on the underlying infrastructure it depends on (i.e. that Spanner table), overloading the infrastructure. Service Control did not have the appropriate randomized exponential backoff implemented to avoid this. It took up to ~2h 40 mins to fully resolve in us-central-1 as we throttled task creation to minimize the impact on the underlying infrastructure and routed traffic to multi-regional databases to reduce the load."
In my experience this happens more often than not: In an exceptional situation like a recovery of many nodes quotas that make sense in regular operations get exceeded quickly and you run into another failure scenario. As long as the underlying infrastructure can cope with it, it's good if you can disable quotas temporarily and quickly. Or throttle the recovery operations that naturally take longer in that case.
this is really amateur level stuff: NPEs, no error handling, no exponential backoff, no test coverage, no testing in staging, no gradual rollout, fail deadly
I read their SRE books, all of this stuff is in there: https://sre.google/sre-book/table-of-contents/ https://google.github.io/building-secure-and-reliable-system...
have standards slipped? or was the book just marketing
Nearly every global outage at Google has looked vaguely like this. I.e. a bespoke system that rapidly deploys configs globally gets a bad config.
All the standard tools for binary rollouts and config pushes will typically do some kind of gradual rollout.
In some ways Google Cloud had actually greatly improved the situation since a bunch of global systems were forced to become regional and/or become much more reliable. Google also used to have short global outages that weren't publicly remarked on (at the time, if you couldn't connect to Google, you assumed your own ISP was broken), so this event wasn't as rare as you might think. Overall I don't think there is a worsening trend unless someone has a spreadsheet of incidents proving otherwise.
[I was an SRE at Google several years ago]
From the OP:
> a policy change was inserted into the regional Spanner tables that Service Control uses for policies. Given the global nature of quota management, this metadata was replicated globally within seconds
If there’s a root cause here, it’s that “given the global nature of quota management” wasn’t seen as a red flag that “quota policy changes must use the standard gradual rollout tooling.”
The baseline can’t be “the trend isn’t worsening;” the baseline should be that if global config rollouts are commonly the cause of problems, there should be increasingly elevated standards for when config systems can bypass best practices. Clearly that didn’t happen here.
4 replies →
As an outsider my quick guess is that at some point after enough layoffs and the CEO accusing everyone of being lazy, people focus on speed/perceived output over quality. After a while the culture shifts so if you block such things then you're the problem and will be ostracized.
As an outsider, what I perceive is quite different:
HN likes to pretend that FAANG is the pinnacle of existence. The best engineers, the best standards, the most “that wouldn’t have happened here,” the yardstick by which all companies should be measured for engineering prowess.
Incidents like this repeatedly happening reveal that’s mostly a myth. They aren’t much smarter, their standards are somewhat wishful thinking, their accomplishments are mostly rooted in the problems they needed to solve just like any other company.
7 replies →
IMO, it's that any defense, humans or automated, is imperfect, and life is a series of tradeoffs.
You can write as many unit tests as you want, and integration tests that your system works as you expect on sample data, static analysis to scream if you're doing something visibly unsafe, staged rollout from nightly builds to production, and so on and so on, but eventually, at large enough scale, you're going to find a gap in those layered safety measures, and if you're unlucky, it's going to be a gap in all of them at once.
It's the same reasoning from said book as why getting another nine is always going to involve much more work than the previous ones - eventually you're doing things like setting up complete copies of your stack running stable builds from months ago and replaying all the traffic to them in order to be able to fail over to them on a moment's notice, meaning that you also can't roll out new features until the backup copies support it too, and that's a level of cost/benefit that nobody can pay if the service is large enough.
When working on OpenZFS, a number of bugs have come from things like "this code in isolation works as expected, but an edge case we didn't know about in data written 10 years ago came up", or "this range of Red Hat kernels from 3 Red Hat releases ago has buggy behavior, and since we test on the latest kernel of that release, we didn't catch it".
Eventually, if there's enough complexity in the system, you cannot feasibly test even all the variation you know about, so you make tradeoffs based on what gets enough benefit for the cost.
(I'm an SRE at Google, not on any team related to this incident, all opinions unofficial/my own, etc.)
I wish they would share more details here. Your take isn't fully correct. There was testing, just not for the bad input (the blank fields in the policy). They also didn't say there was no testing in staging, just that a flag would have caught it.
Opinions are my own.
> There was testing, just not for the bad input (the blank fields in the policy).
But if you change a schema, be it DB, protobuf, whatever, this is the major thing your tests should be covering.
This is why people are so amazed by it.
2 replies →
...the constant familiarity with even the most dangerous instruments soon makes men loose their first caution in handling them; they readily, therefore, come to think that the rules laid down for their guidance are unnecessarily strict - report on the explosion of a gunpowder magazine at Erith, 1864
The incredible reliability and high standards in most of the air travel industry proves this wrong.
2 replies →
That book was written with 40% of the engineers compared to when I left a couple years ago (not sure how many now with the layoffs now). I'm guessing those hires haven't read it yet. So yeah, reads like standards slipping to me.
Try to a few Lighthouse measurements on their web pages and you’ll see they don’t maintain the highest engineering standards.
Yep, people (including me) like to shit on Next.js, but I have a fairly complex app, and it’s still at 100 100 100 100
At google scale, if their standards were not sky high, such incidents would be happening daily. That it happens once in a blue moon indicates that they are really meticulous with all those processes and safeguards almost all the time.
someone vibe coded a push to prod on friday?
Ironically Gemini wouldn’t forget a null check
- You do know their AZs are just firewalls across the same datacenter?
- And they used machines without ECC and their index got corrupted because of it? And instead of hiding the head in shame and getting lessons from IBM old timers they published a paper about it?
- What really accelerated the demise of Google+ was that an API issue allowed the harvesting of private profile fields for millions of users, and they hid that for months fearing the backlash...
Dont worry, you will have plenty more outages from the land of we only hire the best....
I wonder if machines without ECC could perhaps explain why our apps periodically see TCP streams with scrambled contents.
On GKE, we see different services (like Postgres and NATS) running on the same VM in different containers receive/send stream contents (e.g. HTTP responses) where the packets of the stream have been mangled with the contents of other packets. We've been seeing it since 2024, and all the investigation we've done points to something outside our apps and deeper in the system. We've only seen it in one Kubernetes cluster, and it lasts 2-3 hours and then magically resolves itself; draining the node also fixes it.
If there are physical nodes with faulty RAM, I bet something like this could happen. Or there's a bug in their SDN or their patched version of the Linux kernel.
4 replies →
Standards fallen?
Google’s standards, and from what I can tell, most FAANG standards are like beauty filters on Instagram. Judging yourself, or any company, against them is delusional.
I work on Cloud, but not this service. In general:
- All the code has unit tests and integration tests
- Binary and config file changes roll out slowly job by job, region by region, typically over several days. Canary analysis verifies these slow rollouts.
- Even panic rollbacks are done relatively slowly to avoid making the situation worse. For example globally overloading databases with job restarts. A 40m outage is better than a 4 hour outage.
I have no insider knowledge of this incident, but my read of the PM is: The code was tested, but not this edge case. The quota policy config is not rolled out as a config file, but by updating a database. The database was configured for replication which meant the change appeared in all the databases globally within seconds instead of applying job by job, region by region, like a binary or config file change.
I agree on the frustration with null pointers, though if this was a situation the engineers thought was impossible it could have just as likely been an assert() in another language making all the requests fail policy checks as well.
Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded, that all quota policy checks fail open, and that db changes roll out slowly region by region.
Disclaimer: this is all unofficial and my personal opinions.
> it could have just as likely been an assert() in another language
Asserts are much easier to forbid by policy.
That's fair, though `if (isInvalidPolicy) reject();` causes the same outage. So the eng process policy change seems to be failing open and slow rollouts to catch that case too.
> The code was tested, but not this edge case.
so... it wasn't tested
So you need to write a test for every single possible case to consider your code tested?
1 reply →
> Rewriting a critical service like this in another language seems way higher risk than making sure all policy checks are flag guarded
So like, the requirements are unknown? Or this service isn't critical enough to staff a careful migration?
How is the fact that it was a database change and not a binary or a config supposed to make it ok? A change is a change, global changes that go everywhere at once are a recipe for disaster, it doesn't matter what kind of changes we're talking about. This is a second Crowdsrike
This is the core point. A canary deployment that was not preceded by deploying data that activates the region of the binary in question will prove nothing useful at all, while promoting a false sense of security.
2 replies →
It’s interesting that multi-region is often touted as a mechanism for resilience and availability, but for the most part, large cloud providers seem hopelessly intertwined across regions during outages like these.
> Without the appropriate error handling, the null pointer caused the binary to crash.
We must be at the trillion dollar mistake by now, right?
I wonder how many SLAs they just cooked for the year
of their own or their customers with theirs?
If only we had a language that could prevent those /s
Which one do you have in mind? Haskell?
15 replies →
Good luck re-writing 25 years of C++ though.
2 replies →
I am an insider, hence the throw away account.
The root cause of this incident was leadership driving velocity by cutting corners. It has been going on for years, eventually over the cliff.
This specific failure mode is known as query of death. A query triggers an existing bug that causes the server to crash. It is inevitable for C++ servers.
Service Control is in C++. It uses a comprehensive set of engineering guidelines to minimize and tolerate query of death and other failure modes. Before this incident, it had no major incident in the previous decade.
This incident is related to a new global quota policy. It was built quickly under leadership pressure, cutting corners. Such features should be built in a secondary service, or at least following the established engineering guidelines.
Regarding the action items mentioned in the report, the established engineering guidelines far exceed them. The team has been keeping up with their standard as much as they can.
> This policy data contained unintended blank fields. Service Control, then regionally exercised quota checks on policies in each regional datastore. This pulled in blank fields for this respective policy change and exercised the code path that hit the null pointer causing the binaries to go into a crash loop.
Another example of Hoare’s “billion-dollar mistake” in multiple multiple Google systems:
- Why is it possible to insert unintended “blank fields” (nulls)? The configuration should have a schema type that doesn’t allow unintended nulls. Unfortunately Spanner itself is SQL-like and so fields must be declared NOT NULL explicitly, the default is nullable fields.
- Even so, the program that manages these policies will have its own type system and possibly an application level schema language for the configuration. This is another opportunity to make invalid states unrepresentable.
- Then in Service Control, there’s an opportunity to prove “schema on read” as you deserialize policies from the data store into application objects, again either a programming language type or application level schema could be used to validate policy rows have the expected shape before they leave the data layer. Perhaps the null pointer error occurred in this layer, but since this issue occurred in a new code path, it sounds more likely the invalid data escaped the data layer into application code.
- Finally, the Service Control application is written in a language that allows for null pointer references.
If I were a maintainer of this system, the minimally invasive chance I would be thinking about, is how to introduce an application level schema to the policy writer and the policy reader that uses a “tagged enum type” or “union type” or “sum type” to represent policies that cannot express null. Ideally each new kind of policy could be expressed as a new variant added to the union type. You can add this in app code without rewriting the whole program to a safe language. Unfortunately it seems proto3, google’s usual schema language, doesn’t have this constraint.
Example of one that does: https://github.com/stepchowfun/typical
Everyone loves to criticise downtime when it happens to others, saying these are “junior level mistakes” and so on. Until it happens to them, and then there’s a convenient excuse as to why it was either unavoidable or unforeseeable. Truth is humans make mistakes and the expectations are too high.
When a brick and mortar business has to shut unexpectedly they’ll put a sign on the door, apologise and that’s that. Only in tech do we stress so much about a few hours per year. I wish everyone would relax a bit.
[dead]
Google post-mortems never cease to amaze me. From seeing it inside the company to outside. The level of detail, its amazing. The thing is. They will never make the same mistake again. They learn from it, put in the correct protocols and error handling and then create an even more robust system. The thing is, at the scale of Google there is always something going wrong, the point is, how is it being handled not to affect the customer/user and other systems. Honestly it's an ongoing thing you don't see unless you're inside and even then on a per team basis you might see things no one else is seeing. It is probably the closet we're going to come to the most complex systems of the universe, because we as humans will never do better than this. Maybe AGI does, but we won't.
But this is a whole series of junior level mistakes:
* Not dealing with null data properly
* Not testing it properly
* Not having test coverage showing your new thing is tested
* Not exercising it on a subset of prod after deployment to show it works without falling over before it gets pushed absolutely everywhere
Standards in this industry have dropped over the years, but by this much? If you had done this 10 years ago as a Google customer for something far less critical everyone on their side would be smugly lolling at you, and rightly so.
> junior level mistakes
> Not dealing with null data properly
This is _hardly_ a "junior level mistake". That kind of bug is pervasive in all the languages they're likely using for this service (Go, Java, C++) written even by the most "senior" developers.
3 replies →
As I understand it the outage was caused by several mistakes:
1) A global feature release that went everywhere at the same time
2) Null pointer dereference
3) Lack of appropriate retry policies that resulted in a thundering herd problem
All of these are absolutely standard mistakes that everyone who's worked in the industry for some time has seen numerous times. There is nothing novel here, no weird distibited system logic, no google scale, just rookie mistakes all the way
Null pointer dereference has nothing to do with that problem, they have billions of loc you think that kind of issue will never happen?
This is 100% a process problem.
10 replies →
This error was an uncaught null pointer issue.
For a company the size and quality of Google to be bringing down the majority of their stack with this type of error really suggests they do not implement appropriate mitigations after serious issues.
> They will never make the same mistake again
They rolled out a change without feature flagging, didn’t implement exponential backoffs in the clients, didn’t implement load shedding in the servers.
This is all in the google SRE book from many years ago.
This is literally the same mistake that has been made many times before. Of course it will be made again. “New feature is rolled out carefully with a bug that remains latent until triggered by new data” could summarize most global outages.
The thing is, nobody is perfect. Except armchair HN commenters on threads about FAANG outages, of course.
So yes, for some of them. But not this one: this one is a major embarrassment.
Amateur hour in Mountain View.
I recently started as a GCP SRE. I don't have insider knowledge about this, and my views on it are my own.
The most important thing to look at is how much had to go wrong for this to surface. It had to be a bug without test coverage that wasn't covered by staged rollouts or guarded by a feature flag. That essentially means a config-in-db change. Detection was fast, but rolling out the fix was slow out of fear of making things worse.
The NPE aspect is less interesting. It could have been any number of similar "this can't happen" errors. It could have been mutually exclusive fields are present in a JSON object, and the handling logic does funny things. Validation during mutation makes sense, but the rollout strategy is more important since it can catch and mitigate things you haven't thought of.
Such absurdity that as a customer we (HN) knew more than the official support
Regardless of the business need for near instantaneous consistency of the data globally (i.e. quota management settings are global), data replication needs to be propagated incrementally with sufficient time to validate and detect issues.
This reads to me like someone finally won an argument they’d been having for some time.
Null pointers strike again.
More like yolo deploys without proper scaffolding in place to handle SHTF. Also, if your red-button takes 40 minutes and deploys to mitigate anything, it isn't a red-button.
> Without the appropriate error handling, the null pointer caused the binary to crash. Feature flags are used to gradually enable the feature region by region per project, starting with internal projects, to enable us to catch issues. If this had been flag protected, the issue would have been caught in staging.
So some combination of both.
Has there been any confirmation it was golang?
Pretty sure it's C++.
inb4 someone yells Rust. No, your `.unwrap()` happily panics in Rust too.
7 replies →
If you write code that crashes on a blank field you'll manage to do that with any language
Err yeah but the point is languages without null pointers all over the place make it harder to do that in the first place. You normally get some kind of type error at compile time.
3 replies →
> The issue with this change was that it did not have appropriate error handling nor was it feature flag protected.
I've been there. The product guy needs the new feature enabled for everyone, and he needs it yesterday. Suggestions of feature flagging are ignored outright. The feature is then shipped for every user, and fun ensues.
Usually Google and FAANG outages in general are due to things that happen only at Google scale but this incident seems from a generic small/medium company with 30 engineers at most.
I've seen it happen at companies with 100s of engineers, some of them ex-FAANG, including ex-Googlers. The average FAANG engineer has a way more pedestrian work ethic than the FAANG engineers on HN want you to believe.
Why did it take so long for Google to update their status page at https://www.google.com/appsstatus/dashboard/? According to this report, the issue started at 10:49 am PT, but when I checked the Google status page at 11:35 am PT, everything was still green. I think this is something else they need to investigate.
This is explained in the link you’re commenting on.
> We will modularize Service Control’s architecture, so the functionality is isolated and fails open. Thus, if a corresponding check fails, Service Control can still serve API requests.
If I understood it correctly, this service checks proper authorisation among other things, so isn’t failing open a security risk?
This was the bit that I spotted as potentially conflicting as well. Having managed (and sanitised!) tech&security policies at a small tech company, the fail-open vs. fail-closed decisions are rarely clear cut. What makes it worse is that a panicked C-suite member can make a blanket policy decision without consulting anyone outside their own circle.
The downstream effects tend to be pretty grim, and to make things worse, they start to show up only after 6 months. It's also a coinflip whether a reverse decision will be made after another major outage - itself directly attributable to the decisions made in the aftermath of the previous one.
What makes these kinds of issues particularly challenging is that by their very definition, the conditions and rules will be codified deep inside nested error handling paths. As an engineer maintaining these systems, you are outside of the battle tested happy paths and first-level unhappy paths. The conditions to end up in these second/third-level failure modes are not necessarily well understood, let alone reproducible at will. It's like writing code in C, and having all your multi-level error conditions be declared 'volatile' because they may be changed by an external force at any time, behind your back.
I think this was probably specific to the quota checks.
The big issue here (other than the feature rollout) is the lack of throttling. Exponential backoff is a fairly standard integration for scaled applications. Most cloud services use it. I’m surprised it wasn’t implemented for something as fundamental as Service Control.
My reading is that this is on start up. E.g., some config needs to be read when tasks come up. It's easy to have backoff in your normal API path, but miss it in all the other places your code talks to services.
Does borg not backoff your task if it crash loops? That is how k8s does it.
1 reply →
Maybe it's just my lack of reading comprehension but some of the wording in this report feels off:
> this code change came with a red-button to turn off that particular policy serving path.
> the root cause was identified and the red-button (to disable the serving path) was being put in place
So the red-button was or wasn't in place on May 29th? The first sentence implies it was ready to be used but the second implies it had to be added. A red-button sounds like a thing that's already in place and can be triggered immediately, but this sounds like an additional change had to be deployed?
> Without the appropriate error handling, the null pointer caused the binary to crash
This is the first mention of null pointer (and _the_ null pointer too, not just _a_ null pointer) this implies the specific null pointer that would have caused a problem was known at this point? And this wasn't an early issue?
I don't mean to play armchair architect and genuinely want to understand this from a blameless post-mortem point-of-view given the scale of the incident, but the wording in this report doesn't quite add up.
(Edit for formatting)
So code that was untested (the code path that failed was never exercised), perhaps there is no test environment, and not even peer reviewed ( it did not have appropriate error handling nor was it feature flag protected.) was pushed to production, what a surprise !!
Continuous Integration/Continuous Disaster
No amount of "whatever" can prevent bugs to reach production
I would not be surprised if the code was AI generated.
I like the faith you have that people weren't making null-pointer mistakes before LLMs.
1 reply →
There absolutely is a test environment, it was absolutely reviewed and Google has absolutely spent Moon-landing money on testing and in particular static analysis.
Moon landing money on static analysis that failed to identify the existence of a completely untested code path? Or even to shake this out with random data generation?
This is a dumbfounding level of mistake for an organization such as Google.
3 replies →
ok so what gives then?
[dead]
(Throwaway since I was part of a related team a while back)
Service Control (Chemist) is a somewhat old service, been around for about a decade, and is critical for a lot of GCP APIs for authn, authz, auditing, quota etc. Almost mandated in Cloud.
There's a proxy in the path of most GCP APIs, that calls Chemist before forwarding requests to the backend. (Hence I don't think fail open mitigation mentioned in post-mortem will work)
Both Chemist and the proxy are written in C++, and have picked up a ton of legacy cruft over the years.
The teams have extensive static analysis & testing, gradual rollouts, feature flags, red buttons and strong monitoring/alerting systems in place. The SREs in particular are pretty amazing.
Since Chemist handles a lot of policy checks like IAM, quotas, etc., other teams involved in those areas have contributed to the codebase. Over time, shortcuts have been taken so those teams don’t have to go through Chemist's approval for every change.
However, in the past few years, the organization’s seen a lot of churn and a lot of offshoring too. Which has led to a bigger focus on flashy, new projects led by L8/L9s to justify headcount instead of prioritizing quality, maintenance, and reliability. This shift has contributed to a drop in quality standards and increased pressure to ship things out faster (and one of the reasons I ended up leaving Cloud).
Also many of the servers/services best practices common at Google are not so common here.
That said, in this specific case, it seems like the issue is more about lackluster code and code review. (iirc code was merged despite some failures). And pushing config changes instantly through Spanner made it worse.
I saw a lot of third party services (i.e. CloudFlare) go down, but did any non-Cloud Google properties see an impact?
It’d say something that core Google products don’t or won’t take a dependency on Google Cloud…
I would be interested in seeing the elapsed time to recovery for each location up to us-central-1.
Is this information available anywhere?
> If this had been flag protected, the issue would have been caught in staging.
I’m a bit confused by this, it seems the new code was enabled by default and so should have been caught in staging.
> We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage.
No one wondered hm maybe this isn’t a good idea?
Getting approval from marketing/senior leadership to setup this thing that is outside normal infrastructure is very difficult so it was probably decided against.
If this wasn’t vibe coded I’ll eat a frog or something.
Guess all that leet code screening only goes so far, huh?
No error handling, empty fields no one noticed. Was this change carelessly vibe coded?
> We will improve our static analysis and testing practices to correctly handle errors and if need be fail open.
> Without the appropriate error handling, the null pointer caused the binary to crash.
Even worse if this was AI generated C or C++ code, wasn't this not tested before deployment?
This is why you write tests before the acutal code and why vibe-coding is a scam as well. This would also never have happened if it was in Rust.
I expect far better than this from Google and we are still dealing with null pointer crashes to this day.
I'm as much of a Rust advocate as anyone, but what does vibe-coding have to do with any of this?
Seems to just be the new thing to casually hate on.
"If this was vibe coded, this is even worse. This proves that vibe coding is bad."
>We will enforce all changes to critical binaries to be feature flag protected and disabled by default.
Not only if this was true it would kill developer productivity, but how can this even be done. When the compiler versions bumps that's a lot changes that need to be gated. Every team that works on a dependency will have to add feature flags for any change and bug fix they make.
Yeah, some of their takeaways seem overly stifling. This sounds like it wasn't a case of a broken process or missing ingredients. All the tools to prevent it were there (feature flags, null handling, basic static analysis tools). Someone just didn't know to use them.
This also got a laugh:
We posted our first incident report to Cloud Service Health about ~1h after the start of the crashes, due to the Cloud Service Health infrastructure being down due to this outage. For some customers, the monitoring infrastructure they had running on Google Cloud was also failing, leaving them without a signal of the incident or an understanding of the impact to their business and/or infrastructure.
You should always have at least some kind of basic monitoring that's on completely separate infrastructure, ideally from a different vendor. (And maybe Google should too)
this is the worst GCP outage I can remember
> It took up to ~2h 40 mins to fully resolve in us-central-1
this would have cost their customers tens of millions, maybe north of $100M.
not surprised they'd have an extreme write up like this.
I've actually seen this happen before. Most changes get deployed with a gradually enabled(% of users/cities/...) feature flag and then cleaned up later on. There is a slack notification from the central service which manages them that tells you how many rollouts are complete reminding you to clean them up. It escalates to SRE if you don't pay heed for long enough.
In the rollout duration, the combinatorial explosion of code paths is rather annoying to deal with. On the other hand, it does encourage not having too many things going on at once. If some of the changes affect business metrics, it will be hard to glean any insights if you have too many of them going on at once.
And yet it's not even close to the productivity hurting rules and procedures followed for aerospace software.
Rules in aerospace are written in blood. Rules of internet software are written in inconvenience, so productivity is usually given much higher priority than reducing risk of catastrophic failure.
Whatever this "red-button" technology is, is pants. If you know you want to turn something off at incident + 10 mins, it should be off within a minute. Not "Preparing a change to trigger the red-button", but "the stop flag was set by an operator in a minute and was synched globally within seconds".
I mean, it's not like they don't have that technology: the worldwide sync was exactly what caused the outage.
At $WORK we use Consul for this job.
Generally, even these emergency changes are done not entirely immediately to prevent a fix from making things worse. This is an operational choice though, not a technical limitation. My guess being involved in similar issues in the past is the ~15 minute delay preparing the change was either that it wasn't a normally used big red button, so it wasn't clear how to use it, or there was some other friction preparing the change.
What is the difference between a red button and a feature flag, anyway? The report says there was no feature flagging, yet they had this "red button".
It sounds to me like something needed to be recompiled and redeployed.
TLDR a dev forgot an if err != nil { return 0, err } in some critical service
More critically, some service allowed for unvalidated instant rollout of policy changes to prod. The actual bug in the code is largely irrelevant compared to the first issue.
Processes and systems have flaws that can be fixed, humans always will make mistakes.
Does the Go compiler not force you to handle errors
It doesn't exactly, no. There are linters and a compiler-enforced check preventing unused variables. Overall it's pretty easy to accidentally drop errors or overwrite them before checking.
So no, it doesn't.
4 replies →
No language forces to handle errors. Even Rust.
3 replies →
Probably C++ or java though.
That is still the most bizarre error handling pattern that is just completely accepted by the Go community. And after a decade of trying to fix it, the Go team just recently said that they've given up.
It's not that bizarre, it's exactly how errors used to be handled in older languages like C. Golang is not strange, it's just outdated. It was created by Unix greybeards, after all
1 reply →
TLDR, unexpected blank fields
> policy change was inserted into the regional Spanner tables
> This policy data contained unintended blank fields
> Service Control... pulled in blank fields... hit null pointer causing the binaries to go into a crash loop
Really whatever null condition caused a crash is mostly irrelevant. The big problem is instantaneous global replication of policy changes . All code can fail unexpectedly, the gradual rollout of the code was pointless since it doesn't take effect until any policy changes. Yet policy changes are near instantaneous.
lol at whoever approved the report not catching the fuckup of “red-button” instead of “big red button”.
With all of the basic mistakes in the content of this report, how is removal of the word "big" worthy of description as a fuckup?
Still not rewriting in Rust?
For all practical purposes impossible at this scale. The issue is not really the bug tbh.
What is the issue?
2 replies →
They probably will now
enough with rust already