Comment by usrnm

2 days ago

As I understand it the outage was caused by several mistakes:

1) A global feature release that went everywhere at the same time

2) Null pointer dereference

3) Lack of appropriate retry policies that resulted in a thundering herd problem

All of these are absolutely standard mistakes that everyone who's worked in the industry for some time has seen numerous times. There is nothing novel here, no weird distibited system logic, no google scale, just rookie mistakes all the way

11 comments

usrnm

Thaxll 2 days ago

Null pointer dereference has nothing to do with that problem, they have billions of loc you think that kind of issue will never happen?

This is 100% a process problem.

usrnm 2 days ago

A null pointer crashing your entire service in 2025 is a process problem
Mond_ 2 days ago
I mean, I don't want to be a "Keeps bringing up Rust in the comments" sort of guy, but null pointer dereferences are in fact a problem that can be solved essentially completely using strict enough typing and static analysis.
Most existing software stacks (including Google's C++, Go, Java) are by no means in a position to solve this problem, but that doesn't change that it is, in fact, a problem that fundamentally can be solved using types.
Of course, that'd require a full rewrite of the service.
- bjackman 15 hours ago
  
  Rust only stops you crashing the process in the "oh, I didn't realise this field can be null" case.
  It doesn't help with the "yes, in theory it can be null, but in practice it never will be" case. Once you write .expect() you are crashing the service just as badly as a dereference when your assumptions turn out to be wrong.
- Thaxll 2 days ago
  
  You're assuming that not having a crash would prevent the problem they had.
  
  5 replies →
- koakuma-chan 2 days ago
  
  You just brought up Rust. Welcome to the club.