Comment by asim
2 days ago
Google post-mortems never cease to amaze me. From seeing it inside the company to outside. The level of detail, its amazing. The thing is. They will never make the same mistake again. They learn from it, put in the correct protocols and error handling and then create an even more robust system. The thing is, at the scale of Google there is always something going wrong, the point is, how is it being handled not to affect the customer/user and other systems. Honestly it's an ongoing thing you don't see unless you're inside and even then on a per team basis you might see things no one else is seeing. It is probably the closet we're going to come to the most complex systems of the universe, because we as humans will never do better than this. Maybe AGI does, but we won't.
But this is a whole series of junior level mistakes:
* Not dealing with null data properly
* Not testing it properly
* Not having test coverage showing your new thing is tested
* Not exercising it on a subset of prod after deployment to show it works without falling over before it gets pushed absolutely everywhere
Standards in this industry have dropped over the years, but by this much? If you had done this 10 years ago as a Google customer for something far less critical everyone on their side would be smugly lolling at you, and rightly so.
> junior level mistakes
> Not dealing with null data properly
This is _hardly_ a "junior level mistake". That kind of bug is pervasive in all the languages they're likely using for this service (Go, Java, C++) written even by the most "senior" developers.
From their report: "This policy data contained unintended blank fields."
This was reading fields from a database which were coming back null, not some pointer that mutates to being null after a series of nasty state transitions, and so this is very much in the junior category.
2 replies →
As I understand it the outage was caused by several mistakes:
1) A global feature release that went everywhere at the same time
2) Null pointer dereference
3) Lack of appropriate retry policies that resulted in a thundering herd problem
All of these are absolutely standard mistakes that everyone who's worked in the industry for some time has seen numerous times. There is nothing novel here, no weird distibited system logic, no google scale, just rookie mistakes all the way
Null pointer dereference has nothing to do with that problem, they have billions of loc you think that kind of issue will never happen?
This is 100% a process problem.
A null pointer crashing your entire service in 2025 is a process problem
I mean, I don't want to be a "Keeps bringing up Rust in the comments" sort of guy, but null pointer dereferences are in fact a problem that can be solved essentially completely using strict enough typing and static analysis.
Most existing software stacks (including Google's C++, Go, Java) are by no means in a position to solve this problem, but that doesn't change that it is, in fact, a problem that fundamentally can be solved using types.
Of course, that'd require a full rewrite of the service.
8 replies →
> They will never make the same mistake again
They rolled out a change without feature flagging, didn’t implement exponential backoffs in the clients, didn’t implement load shedding in the servers.
This is all in the google SRE book from many years ago.
This error was an uncaught null pointer issue.
For a company the size and quality of Google to be bringing down the majority of their stack with this type of error really suggests they do not implement appropriate mitigations after serious issues.
This is literally the same mistake that has been made many times before. Of course it will be made again. “New feature is rolled out carefully with a bug that remains latent until triggered by new data” could summarize most global outages.
The thing is, nobody is perfect. Except armchair HN commenters on threads about FAANG outages, of course.
So yes, for some of them. But not this one: this one is a major embarrassment.
Amateur hour in Mountain View.