Comment by usrnm
2 days ago
As I understand it the outage was caused by several mistakes:
1) A global feature release that went everywhere at the same time
2) Null pointer dereference
3) Lack of appropriate retry policies that resulted in a thundering herd problem
All of these are absolutely standard mistakes that everyone who's worked in the industry for some time has seen numerous times. There is nothing novel here, no weird distibited system logic, no google scale, just rookie mistakes all the way
Null pointer dereference has nothing to do with that problem, they have billions of loc you think that kind of issue will never happen?
This is 100% a process problem.
A null pointer crashing your entire service in 2025 is a process problem
I mean, I don't want to be a "Keeps bringing up Rust in the comments" sort of guy, but null pointer dereferences are in fact a problem that can be solved essentially completely using strict enough typing and static analysis.
Most existing software stacks (including Google's C++, Go, Java) are by no means in a position to solve this problem, but that doesn't change that it is, in fact, a problem that fundamentally can be solved using types.
Of course, that'd require a full rewrite of the service.
Rust only stops you crashing the process in the "oh, I didn't realise this field can be null" case.
It doesn't help with the "yes, in theory it can be null, but in practice it never will be" case. Once you write .expect() you are crashing the service just as badly as a dereference when your assumptions turn out to be wrong.
You're assuming that not having a crash would prevent the problem they had.
5 replies →
You just brought up Rust. Welcome to the club.